ErrorTracker

Differences between revisions 1 and 79 (spanning 78 versions)
Revision 1 as of 2011-07-04 17:12:06
Size: 235
Editor: mpt
Comment: + Prior art
Revision 79 as of 2012-07-12 14:28:50
Size: 20873
Editor: mpt
Comment: + "An application is hung for at least 5 seconds after you try to close its window"
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== Prior art == ## page was renamed from CrashTracker
Line 3: Line 3:
{{attachment:mac-app.png}} {{attachment:mac-plugin.png}} {{attachment:mac-os.png}} {{attachment:mac-hang.png}} <<TableOfContents()>>
Line 5: Line 5:
{{attachment:windows-app-progress.gif}} {{attachment:windows-app.gif}} {{attachment:windows-os.png}} == Rationale ==

To help Ubuntu reach a standard of quality similar to competing operating systems, developers need to know the answers to two questions:

 1. '''How reliable is Ubuntu right now?''' (Compared with yesterday, compared with the previous version, or compared with what it would be if everyone had installed every update.)

 2. '''What’s the best thing I can do right now to help improve its quality?'''

We can better answer both of those questions if we collect '''all the information we need''', for as '''many types of problems''' as we can, from '''a large representative sample''' of people. This means not requiring people to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. It means collecting problem reports both before and after release. And it means analyzing and bucketing problems automatically, with developers able to configure the system to automatically retrieve more information about a particular kind of problem when it next occurs.

Statistics collected by Microsoft show that a bug reported by their Windows Error Reporting system “is 4.5 to 5.1 times more likely to be fixed than a bug reported directly by a human”, that fixing the right 1 percent of bugs addresses 50 percent of customer issues, and that fixing 20 percent of bugs addresses 80 percent of customer issues.

The client interface for the error tracker also serves a purpose which is less important for developers, but ''more'' important for end users: '''explaining why something weird just happened'''. In previous Ubuntu release versions, when most programs crashed there was no explanation of why the window had disappeared.

== Client design ==

<<Anchor(settings)>>
=== Privacy settings ===

The “System” section of System Settings should have a “Privacy” panel with a padlock as its icon. (Eventually this panel may expand to include security settings as well, such as screen locking and disk encryption.)

{{attachment:privacy-settings.png}}

The “People using this computer can:” and following controls should be insensitive whenever you have not unlocked them as an administrator.

In a new Ubuntu installation (or an upgrade to a version that introduces these settings), “Send error reports to Canonical” should be checked by default, and “Send occasional system information to Canonical” should be unchecked.

<<Anchor(error)>>
=== When there is an error ===

When there is an error, an alert should appear with text and buttons depending on the situation. The Esc and Enter keys should ''not'' do anything in these alerts, because you may have been just about to press one of those in the program that has the problem.

|| ||<v>'''You are an admin, or error reporting is allowed'''||<v>'''Your admin has blocked error reporting'''||<v>'''Implemented in Ubuntu'''||
||<^><<Anchor(os-crash)>>[[#os-crash|#]]'''An OS package crashes''' for the first time this version<<BR>>,,Test case: sudo pkill -SEGV zeitgeist,,||<^>{{attachment:os-error-reportable.png}}||<^>{{attachment:os-error-unreportable.png}}||12.04||
||<^>'''An OS package crashes''' a subsequent time||<^>{{attachment:os-error-reportable-subsequent.png}}||<^>{{attachment:os-error-unreportable-subsequent.png}}||12.04||
||<-3 style="border:none;">“Ignore future problems of this type” means ignore future crashes of the same version of the same package.||
||<^><<Anchor(thread)>>[[#thread|#]]'''An application thread crashes''' for the first time this version||<^>{{attachment:app-thread-reportable.png}}||<(|2>(no alert shown)||<(|2>(in 12.04, shows “closed unexpectedly” error instead?)||
||<^>'''An application thread crashes''' a subsequent time||<^>{{attachment:app-thread-reportable-subsequent.png}}||
||<-3 style="border:none">For most other error types, the alert shouldn’t offer to be silent next time — because it still needs to appear to explain what’s happened, and (in the application hang case) to let you stop/relaunch the application:||
||<^><<Anchor(app-requested)>>[[#app-requested|#]]'''An application has a developer-specified error'''||<^>{{attachment:app-requested-reportable.png}}||<^>{{attachment:app-requested-unreportable.png}}||
||<^(|2><<Anchor(app-hang)>>[[#app-hang|#]]'''An application is hung for at least 30 seconds'''<<BR>>,,Test case: eog & sleep 5 && pkill -STOP eog && sleep 20 && pkill -CONT eog then wait for 30 seconds,,||<^>{{attachment:app-hang-reportable.png}}||<^>{{attachment:app-hang-unreportable.png}}||<|2>(targeted for 12.10)||
||<^-2>This alert should be modal to the unresponsive window. (It appears much later than, because it is more intrusive than, the greying-out of the window after 5 seconds.) If that window becomes responsive again, the alert’s contents should become insensitive for one second (to ignore misclicks) before the alert closes.||
||<^(|2><<Anchor(close-hang)>>[[#close-hang|#]]'''An application is hung for at least 5 seconds after you try to close its window'''<<BR>>,,Test case: eog & sleep 5 && pkill -STOP eog & pkill -TERM eog && sleep 5,,||<^>{{attachment:close-hang-reportable.png}}||<^>{{attachment:close-hang-unreportable.png}}||<|2>(targeted for 12.10)||
||<^-2>This alert should be modal to the unresponsive window, and should therefore have no title of its own. If that parent window becomes responsive again, the alert’s contents should become insensitive for one second (to ignore misclicks) before the alert and the window both close.||
||<^><<Anchor(app-crash)>>[[#app-crash|#]]'''An application crashes'''<<BR>>,,Test case: eog & pkill -SEGV eog,,||<^>{{attachment:app-crash-reportable.png}}||<^>{{attachment:app-crash-unreportable.png}}||12.04||
||<^><<Anchor(kernel-oops)>>[[#kernel-oops|#]]'''Ubuntu restarts after a kernel oops'''||<^>{{attachment:kernel-oops-reportable.png}}||<^>{{attachment:kernel-oops-unreportable.png}}||(targeted for 12.04 SRU)||
||<^><<Anchor(install-fails)>>[[#install-fails|#]]'''A package fails to install or update'''||<^>{{attachment:package-error-reportable.png}}||<^>{{attachment:package-error-unreportable.png}}||(targeted for 12.04 SRU)||
||<-3 style="border:none">With non-application software crashing, we can’t tell programmatically whether it’s something you need to care about or not. So if you aren’t going to report the errors, we might as well let you ignore future errors:||
||<^><<Anchor(non-app-crash)>>[[#non-app-crash|#]]'''Third-party non-application software crashes''' for the first time this version<<BR>>,,Test case: sh -c 'kill -SEGV $$',,||<^>{{attachment:nas-crash-reportable.png}}||<(|2>(no alert shown)||12.04||
||<^>'''Third-party non-application software crashes''' a subsequent time||<^>{{attachment:nas-crash-reportable-subsequent.png}}||12.04||
||<-3 style="border:none">For all cases where the “Send an error report to help fix this problem” checkbox is present, its state should persist across errors and across Ubuntu sessions.||
||<^ style="border:none"><<Anchor(details)>>If you choose “Show Details”, it should change to “Hide Details” while a text field containing the error report appears below the primary text.<<BR>><<BR>>If necessary, a spinner and the text “Collecting information…” should appear centered inside the text field while the information is collected (other than the process name and version, which should appear instantly), pausing whenever the collection system is waiting for you to answer any questions.||{{attachment:app-crash-reportable-details.png}}||
||<-3 style="border:none">If you choose to send an error report, the alert should disappear immediately. Data should be collected (if it hasn’t been already), and reports should be sent in the background, with ''no'' progress or success/failure feedback. If you are not connected to the Internet at the time, reports should be queued. Any queued reports should be sent when you next agree to send an error report while online.||
||<^ style="border:none">If you are using a pre-release version of Ubuntu, and the error report matches an existing Launchpad bug report, a further alert box should appear explaining its status and letting you open the bug report.||{{attachment:bug-report.png}}<<BR>>,,Enter = “OK”,,|| ||(targeted for 12.10)||

'''''Future work:''' Ensure that if there is a delay in displaying a crash, we adjust the text of the dialog to reflect this. As an example, if X crashes and the user has to log in again or reboot the computer.''

'''''Future work:''' If a software update is known to fix the problem, replace the primary alert with [[SoftwareUpdates#alert|the software update alert]] (or progress window, depending on the update policy), with customized primary text. Or point them at a web page (not a wiki page!) with details if a workaround exists, but no fix is available yet.''

'''''Future work:''' Automate the communication with the user to facilitate things like leak detection in subsequent runs, without requiring additional interaction with the user. Our current process requires us to ask people who are subscribed to the bug to try a specially-instrumented build, with a traditionally very long feedback loop between the developer and the bug subscribers. We should make it entirely automatic. Just wait for the next user who sees the bug to click one "yes, I'd like to help make this product better" button.''

<<Anchor(multiple)>>
=== When there are multiple simultaneous errors ===

To guard against the case where multiple errors of the same type cause a flood of alert boxes, there should be '''aggregate alert boxes''' for the two most likely cases, internal errors and application crashes.

If an alert box for a single error is open and unfocused, when another error of the same type happens, that alert box should morph into the aggregate version.

||<^>'''Multiple OS packages crash'''||<^>{{attachment:os-error-reportable-multiple.png}}|| ||(?)||
||<^>'''Multiple applications crash'''||<^>{{attachment:app-crash-reportable-multiple.png}}|| ||(?)||

In these cases, the “Show Details” box should show details of all the errors, with a separator between them.

<<Anchor(debconf)>>
=== When there is a Debconf prompt ===

Debconf prompts for user-installed software in Ubuntu are, overwhelmingly, programming mistakes. Therefore, they should be presented as error alerts.

||<(^|2>[[#debconf|#]]'''A Debconf prompt'''||<^>{{attachment:debconf-reportable.png}}||<^>{{attachment:debconf-unreportable.png}}||(targeted for 12.10 — [[/Contributing/Debconf|how to contribute]])||

If the `TITLE` command is used, that string should be used as the title of the window instead of “Ubuntu”.

If the program is not an application, the primary text should begin with ‘The program “{package name}” has a problem…’

If the Debconf prompt occurs during [[http://www.debian.org/doc/debian-policy/ch-maintainerscripts.html#s-removedetails|postinst abort-remove]], the primary text should end with “…finish its removal.”.

Controls should be included in the alert depending on the type of prompt.

||<tablestyle="float:left;"><<Anchor(debconf-string)>>[[#debconf-string|#]]'''Type “string”'''||
||{{attachment:debconf-string.png}}||

||<tablestyle="float:left;"><<Anchor(debconf-boolean)>>[[#debconf-boolean|#]]'''Type “boolean”'''||
||{{attachment:debconf-boolean.png}}||

||<-2 tablestyle="float:left;"><<Anchor(debconf-select)>>[[#debconf-select|#]]'''Type “select”'''||
||If there are six or fewer choices:||If there are more than six choices:||
||{{attachment:debconf-select-few.png}}||{{attachment:debconf-select-many.png}}||

||<-2 tablestyle="float:left;"><<Anchor(debconf-multiselect)>>[[#debconf-multiselect|#]]'''Type “multiselect”'''||
||If there are six or fewer choices:||If there are more than six choices:||
||{{attachment:debconf-multiselect-few.png}}||{{attachment:debconf-multiselect-many.png}}||

||<tablestyle="float:left;"><<Anchor(debconf-note)>>[[#debconf-note|#]]'''Type “note”'''||
||{{attachment:debconf-note.png}}||

||<tablestyle="float:left;"><<Anchor(debconf-text)>>[[#debconf-text|#]]'''Type “text”'''||
||{{attachment:debconf-text.png}}||

||<tablestyle="float:left;"><<Anchor(debconf-password)>>[[#debconf-password|#]]'''Type “password”'''||
||{{attachment:debconf-password.png}}||

||<tablestyle="clear:both;" style="border:none;">||

<<Anchor(metrics)>>
=== Invitation for metrics collection ===

For any administrator, after the ''first'' time only that they respond to an error alert, a second alert should appear to invite them to opt in to metrics collection. (The “Esc” key should activate “Don’t Send” in this alert, but the “Enter” key should not do anything.)

{{attachment:privacy-settings-alert.png}}

The “Privacy…” button should open System Settings to the Privacy panel. Choosing “Send” should be equivalent to checking “Send occasional system information to Canonical” in the Privacy settings.

<<Anchor(debconf-progress)>>
=== Presenting debconf progress ===

''This has nothing to do with error tracking, but is included here because the error tracker provides all the rest of debonf’s graphical UI.''

When a maintainer script requests progress presentation (`db_progress`), the progress text and proportion should be shown in a progress window. Ideally, this progress window should morph to and/or from any consecutive debconf prompts.

{{attachment:debconf-progress.png}}

The window title should be of the form ‘Installing “package-name”’, ‘Reinstalling “package-name”’, ‘Updating “package-name”’, or “Removing ‘package-name’” as appropriate, or “Ubuntu” if the type of operation is unknown. The “Skip” button should be present if the operation is skippable.

== Client implementation ==

The apport client will write a .upload file alongside a .crash file to indicate that the crash should be sent to the crash database. A small C daemon (currently reporterd) will set up an inotify watch on the /var/crash directory, and any time one of these .upload files appears, it will upload the .crash file. It will do this if and only if there is an active Internet connection, as determined by watching the NetworkManager DBus API for connectivity events, otherwise it will add it to a queue for later processing.

We will ensure NetworkManager brings up the interfaces as early as possible, to enable us to file crash reports during boot.

This needs to be a daemon, rather than another path of the apport client code, to account for there not being an Internet connection at the time of the crash and for crashes during boot, when we cannot assume the user will get back to a known-good state to file the report.

The canonical example here is the scenario posed in Microsoft’s Windows Error Reporting paper, where a piece of malware was causing the core desktop application (explorer.exe) to crash. They were still able to receive crash reports, as their client software still submitted reports very early on in the boot process.

The apport crash file will be parsed into an intermediate data structure (currently a GHashTable), with the core dump stripped out, and then converted into BSON to be transmitted in a HTTP POST operation. The server will reply with a UUID for subsequent operations and, optionally, a command for further action. Initially, this will just be a command to upload the core dump.

A new field is being added to the apport crash file, StacktraceAddressSignature. The server will check for this field, and if it already has a retraced core dump generated from the same signature, it will reply with just the UUID of the crash report entry in the database, indicating that a core dump need not be submitted.

If, however, the server does reply with a request to upload the core dump, it will be sent as zlib compressed data in an HTTP POST operation.

The URLs for posting will be of the form:
 - http://crashes.ubuntu.com/submit
 - http://crashes.ubuntu.com/550e8400-e29b-41d4-a716-446655440000/submit-core

Crash reports will be cleaned up after 14 days, as the system may never be connected to the Internet.

If the reporter daemon crashes, it will write a crash file like any other application. Its upstart job will have the respawn flag set, and a limit put in place so it doesn't go crazy.

If the reporter daemon moves to using apport-unpack to process the crash files, it should gracefully handle -ENOSPC.

Crash reports for applications not themselves part of packages in the Ubuntu will be handled. These will not be retraced, but they will be collected for statistical analysis. This removes the "the problem cannot be reported" dialog in Apport.

We will add an Origin and possibly a Site field to the apport reports, using the python-apt candidate.origins interface. This will allow us to answer questions like what percentage of crashes are coming from PPAs. More importantly, it will let us focus reports on packages from a particular PPA, like the unity-testing one.

== Server design ==
[[/ServerArchitecture]] has additional details.

We will use Robert Collins’ oops-repository as the foundation for our crash database. It has been suggested that this can meet Launchpad’s crash reporting requirements, scaling to a high volume of reports (e.g. 1M/day). We will also use the OOPS dictionary format for our crashes.

This will make integrating with Launchpad’s longer-term plans of this as a service for all projects an easier challenge. Launchpad’s offering may be implemented as one big Cassandra cluster in a multi-tenant fashion, or on a per-project basis, feeding to an API.

oops-repository will also provide the API for interacting with the database. This will include operations to post a new crash and potentially ask for more information, upload additional information (such as the core dump), get the full data for a crash out (a privileged operation), and update an existing crash report (a partially privileged operation) with the retraced data.

We will build a small Django web user interface for management functions on top of this API. The initial implementation will not allow regular developers to access the crash data, as we will not have time in this cycle to address the security concerns around this. Canonical IS will be the interim arbiter of who is able to access this system, inclusive of at least the release manager.

=== Retracing ===

When a new core dump is submitted to the crash database, it will be written to a SAN and the UUID will be added to a RabbitMQ queue for the matching architecture. The queue will also be written in Cassandra, in the event the RabbitMQ service fails.

Retracing daemons for each architecture will pull UUIDs off their respective RabbitMQ queues, get the core dump for the UUID from Cassandra, then feed it through apport-retrace.

When a complete trace is generated, it will be added as a row in the crash column family for the relevant UUID. It will also be added to an index column family where the key is the crash signature (StacktraceAddressSignature) and the value is the UUID in the crash column family. In the future, we may expand this to a more complex bucketing algorithm, as necessary.

The retracing daemon systems will each keep a large cache of the debug symbol packages.

Upon first successful connection to the Internet, the system will send a basic hardware profile, keyed on a SHA512 of the system UUID and a SHA512 of the DMI tables themselves.

This information will be submitted to one of the existing hardware databases. Queries will be possible across the crash database and hardware database. For example, it may be desireable to know what the top compiz crashes are for a particular piece of graphics hardware.

Android uses Google Feedback: http://android-developers.blogspot.com/2010/05/google-feedback-for-android.html

There is a google project for cross platform crashdump capturing.

There is a django project called 'sentry' for web server error analysis (that has a cassandra backend I'm told).

The Launchpad SOA has an active discussion around their requirements at https://dev.launchpad.net/LEP/OopsDisplay. They plan to split out various crash report tools from Launchpad into reusable python modules. It is unknown at this point if they'll be generic enough to fit Ubuntu's needs. Launchpad suspects this could fulfill needs for: Ubuntu One, Landscape, Canonical ISD (SSO etc.), Ubuntu; possibly also Drizzle, OpenERP, OpenStack.

Rationale

To help Ubuntu reach a standard of quality similar to competing operating systems, developers need to know the answers to two questions:

  1. How reliable is Ubuntu right now? (Compared with yesterday, compared with the previous version, or compared with what it would be if everyone had installed every update.)

  2. What’s the best thing I can do right now to help improve its quality?

We can better answer both of those questions if we collect all the information we need, for as many types of problems as we can, from a large representative sample of people. This means not requiring people to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. It means collecting problem reports both before and after release. And it means analyzing and bucketing problems automatically, with developers able to configure the system to automatically retrieve more information about a particular kind of problem when it next occurs.

Statistics collected by Microsoft show that a bug reported by their Windows Error Reporting system “is 4.5 to 5.1 times more likely to be fixed than a bug reported directly by a human”, that fixing the right 1 percent of bugs addresses 50 percent of customer issues, and that fixing 20 percent of bugs addresses 80 percent of customer issues.

The client interface for the error tracker also serves a purpose which is less important for developers, but more important for end users: explaining why something weird just happened. In previous Ubuntu release versions, when most programs crashed there was no explanation of why the window had disappeared.

Client design

Privacy settings

The “System” section of System Settings should have a “Privacy” panel with a padlock as its icon. (Eventually this panel may expand to include security settings as well, such as screen locking and disk encryption.)

privacy-settings.png

The “People using this computer can:” and following controls should be insensitive whenever you have not unlocked them as an administrator.

In a new Ubuntu installation (or an upgrade to a version that introduces these settings), “Send error reports to Canonical” should be checked by default, and “Send occasional system information to Canonical” should be unchecked.

When there is an error

When there is an error, an alert should appear with text and buttons depending on the situation. The Esc and Enter keys should not do anything in these alerts, because you may have been just about to press one of those in the program that has the problem.

You are an admin, or error reporting is allowed

Your admin has blocked error reporting

Implemented in Ubuntu

#An OS package crashes for the first time this version
Test case: sudo pkill -SEGV zeitgeist

12.04

An OS package crashes a subsequent time

12.04

“Ignore future problems of this type” means ignore future crashes of the same version of the same package.

#An application thread crashes for the first time this version

(no alert shown)

(in 12.04, shows “closed unexpectedly” error instead?)

An application thread crashes a subsequent time

For most other error types, the alert shouldn’t offer to be silent next time — because it still needs to appear to explain what’s happened, and (in the application hang case) to let you stop/relaunch the application:

#An application has a developer-specified error

#An application is hung for at least 30 seconds
Test case: eog & sleep 5 && pkill -STOP eog && sleep 20 && pkill -CONT eog then wait for 30 seconds

(targeted for 12.10)

This alert should be modal to the unresponsive window. (It appears much later than, because it is more intrusive than, the greying-out of the window after 5 seconds.) If that window becomes responsive again, the alert’s contents should become insensitive for one second (to ignore misclicks) before the alert closes.

#An application is hung for at least 5 seconds after you try to close its window
Test case: eog & sleep 5 && pkill -STOP eog & pkill -TERM eog && sleep 5

(targeted for 12.10)

This alert should be modal to the unresponsive window, and should therefore have no title of its own. If that parent window becomes responsive again, the alert’s contents should become insensitive for one second (to ignore misclicks) before the alert and the window both close.

#An application crashes
Test case: eog & pkill -SEGV eog

12.04

#Ubuntu restarts after a kernel oops

(targeted for 12.04 SRU)

#A package fails to install or update

(targeted for 12.04 SRU)

With non-application software crashing, we can’t tell programmatically whether it’s something you need to care about or not. So if you aren’t going to report the errors, we might as well let you ignore future errors:

#Third-party non-application software crashes for the first time this version
Test case: sh -c 'kill -SEGV $$'

(no alert shown)

12.04

Third-party non-application software crashes a subsequent time

12.04

For all cases where the “Send an error report to help fix this problem” checkbox is present, its state should persist across errors and across Ubuntu sessions.

If you choose “Show Details”, it should change to “Hide Details” while a text field containing the error report appears below the primary text.

If necessary, a spinner and the text “Collecting information…” should appear centered inside the text field while the information is collected (other than the process name and version, which should appear instantly), pausing whenever the collection system is waiting for you to answer any questions.

If you choose to send an error report, the alert should disappear immediately. Data should be collected (if it hasn’t been already), and reports should be sent in the background, with no progress or success/failure feedback. If you are not connected to the Internet at the time, reports should be queued. Any queued reports should be sent when you next agree to send an error report while online.

If you are using a pre-release version of Ubuntu, and the error report matches an existing Launchpad bug report, a further alert box should appear explaining its status and letting you open the bug report.

bug-report.png
Enter = “OK”

(targeted for 12.10)

Future work: Ensure that if there is a delay in displaying a crash, we adjust the text of the dialog to reflect this. As an example, if X crashes and the user has to log in again or reboot the computer.

Future work: If a software update is known to fix the problem, replace the primary alert with the software update alert (or progress window, depending on the update policy), with customized primary text. Or point them at a web page (not a wiki page!) with details if a workaround exists, but no fix is available yet.

Future work: Automate the communication with the user to facilitate things like leak detection in subsequent runs, without requiring additional interaction with the user. Our current process requires us to ask people who are subscribed to the bug to try a specially-instrumented build, with a traditionally very long feedback loop between the developer and the bug subscribers. We should make it entirely automatic. Just wait for the next user who sees the bug to click one "yes, I'd like to help make this product better" button.

When there are multiple simultaneous errors

To guard against the case where multiple errors of the same type cause a flood of alert boxes, there should be aggregate alert boxes for the two most likely cases, internal errors and application crashes.

If an alert box for a single error is open and unfocused, when another error of the same type happens, that alert box should morph into the aggregate version.

Multiple OS packages crash

(?)

Multiple applications crash

(?)

In these cases, the “Show Details” box should show details of all the errors, with a separator between them.

When there is a Debconf prompt

Debconf prompts for user-installed software in Ubuntu are, overwhelmingly, programming mistakes. Therefore, they should be presented as error alerts.

#A Debconf prompt

debconf-reportable.png

debconf-unreportable.png

(targeted for 12.10 — how to contribute)

If the TITLE command is used, that string should be used as the title of the window instead of “Ubuntu”.

If the program is not an application, the primary text should begin with ‘The program “{package name}” has a problem…’

If the Debconf prompt occurs during postinst abort-remove, the primary text should end with “…finish its removal.”.

Controls should be included in the alert depending on the type of prompt.

#Type “string”

debconf-string.png

#Type “boolean”

debconf-boolean.png

#Type “select”

If there are six or fewer choices:

If there are more than six choices:

debconf-select-few.png

debconf-select-many.png

#Type “multiselect”

If there are six or fewer choices:

If there are more than six choices:

debconf-multiselect-few.png

debconf-multiselect-many.png

#Type “note”

debconf-note.png

#Type “text”

debconf-text.png

#Type “password”

debconf-password.png

Invitation for metrics collection

For any administrator, after the first time only that they respond to an error alert, a second alert should appear to invite them to opt in to metrics collection. (The “Esc” key should activate “Don’t Send” in this alert, but the “Enter” key should not do anything.)

privacy-settings-alert.png

The “Privacy…” button should open System Settings to the Privacy panel. Choosing “Send” should be equivalent to checking “Send occasional system information to Canonical” in the Privacy settings.

Presenting debconf progress

This has nothing to do with error tracking, but is included here because the error tracker provides all the rest of debonf’s graphical UI.

When a maintainer script requests progress presentation (db_progress), the progress text and proportion should be shown in a progress window. Ideally, this progress window should morph to and/or from any consecutive debconf prompts.

debconf-progress.png

The window title should be of the form ‘Installing “package-name”’, ‘Reinstalling “package-name”’, ‘Updating “package-name”’, or “Removing ‘package-name’” as appropriate, or “Ubuntu” if the type of operation is unknown. The “Skip” button should be present if the operation is skippable.

Client implementation

The apport client will write a .upload file alongside a .crash file to indicate that the crash should be sent to the crash database. A small C daemon (currently reporterd) will set up an inotify watch on the /var/crash directory, and any time one of these .upload files appears, it will upload the .crash file. It will do this if and only if there is an active Internet connection, as determined by watching the NetworkManager DBus API for connectivity events, otherwise it will add it to a queue for later processing.

We will ensure NetworkManager brings up the interfaces as early as possible, to enable us to file crash reports during boot.

This needs to be a daemon, rather than another path of the apport client code, to account for there not being an Internet connection at the time of the crash and for crashes during boot, when we cannot assume the user will get back to a known-good state to file the report.

The canonical example here is the scenario posed in Microsoft’s Windows Error Reporting paper, where a piece of malware was causing the core desktop application (explorer.exe) to crash. They were still able to receive crash reports, as their client software still submitted reports very early on in the boot process.

The apport crash file will be parsed into an intermediate data structure (currently a GHashTable), with the core dump stripped out, and then converted into BSON to be transmitted in a HTTP POST operation. The server will reply with a UUID for subsequent operations and, optionally, a command for further action. Initially, this will just be a command to upload the core dump.

A new field is being added to the apport crash file, StacktraceAddressSignature. The server will check for this field, and if it already has a retraced core dump generated from the same signature, it will reply with just the UUID of the crash report entry in the database, indicating that a core dump need not be submitted.

If, however, the server does reply with a request to upload the core dump, it will be sent as zlib compressed data in an HTTP POST operation.

The URLs for posting will be of the form:

Crash reports will be cleaned up after 14 days, as the system may never be connected to the Internet.

If the reporter daemon crashes, it will write a crash file like any other application. Its upstart job will have the respawn flag set, and a limit put in place so it doesn't go crazy.

If the reporter daemon moves to using apport-unpack to process the crash files, it should gracefully handle -ENOSPC.

Crash reports for applications not themselves part of packages in the Ubuntu will be handled. These will not be retraced, but they will be collected for statistical analysis. This removes the "the problem cannot be reported" dialog in Apport.

We will add an Origin and possibly a Site field to the apport reports, using the python-apt candidate.origins interface. This will allow us to answer questions like what percentage of crashes are coming from PPAs. More importantly, it will let us focus reports on packages from a particular PPA, like the unity-testing one.

Server design

/ServerArchitecture has additional details.

We will use Robert Collins’ oops-repository as the foundation for our crash database. It has been suggested that this can meet Launchpad’s crash reporting requirements, scaling to a high volume of reports (e.g. 1M/day). We will also use the OOPS dictionary format for our crashes.

This will make integrating with Launchpad’s longer-term plans of this as a service for all projects an easier challenge. Launchpad’s offering may be implemented as one big Cassandra cluster in a multi-tenant fashion, or on a per-project basis, feeding to an API.

oops-repository will also provide the API for interacting with the database. This will include operations to post a new crash and potentially ask for more information, upload additional information (such as the core dump), get the full data for a crash out (a privileged operation), and update an existing crash report (a partially privileged operation) with the retraced data.

We will build a small Django web user interface for management functions on top of this API. The initial implementation will not allow regular developers to access the crash data, as we will not have time in this cycle to address the security concerns around this. Canonical IS will be the interim arbiter of who is able to access this system, inclusive of at least the release manager.

Retracing

When a new core dump is submitted to the crash database, it will be written to a SAN and the UUID will be added to a RabbitMQ queue for the matching architecture. The queue will also be written in Cassandra, in the event the RabbitMQ service fails.

Retracing daemons for each architecture will pull UUIDs off their respective RabbitMQ queues, get the core dump for the UUID from Cassandra, then feed it through apport-retrace.

When a complete trace is generated, it will be added as a row in the crash column family for the relevant UUID. It will also be added to an index column family where the key is the crash signature (StacktraceAddressSignature) and the value is the UUID in the crash column family. In the future, we may expand this to a more complex bucketing algorithm, as necessary.

The retracing daemon systems will each keep a large cache of the debug symbol packages.

Upon first successful connection to the Internet, the system will send a basic hardware profile, keyed on a SHA512 of the system UUID and a SHA512 of the DMI tables themselves.

This information will be submitted to one of the existing hardware databases. Queries will be possible across the crash database and hardware database. For example, it may be desireable to know what the top compiz crashes are for a particular piece of graphics hardware.

Android uses Google Feedback: http://android-developers.blogspot.com/2010/05/google-feedback-for-android.html

There is a google project for cross platform crashdump capturing.

There is a django project called 'sentry' for web server error analysis (that has a cassandra backend I'm told).

The Launchpad SOA has an active discussion around their requirements at https://dev.launchpad.net/LEP/OopsDisplay. They plan to split out various crash report tools from Launchpad into reusable python modules. It is unknown at this point if they'll be generic enough to fit Ubuntu's needs. Launchpad suspects this could fulfill needs for: Ubuntu One, Landscape, Canonical ISD (SSO etc.), Ubuntu; possibly also Drizzle, OpenERP, OpenStack.

ErrorTracker (last edited 2018-02-27 11:56:11 by mpt)