ErrorTracker/ServerSideHooks

ServerSideHooks

Revision 2 as of 2012-12-04 12:01:26

Rationale

What does a developer do when a stack trace is not enough to completely debug an issue? They could find a user who is experiencing this problem and contact them, asking them to provide additional information. This is time consuming though.

A developer should be able to identify an issue that needs additional information, write a small amount of code to collect this on a system exhibiting the problem, and quickly get that run on such systems. This code should require no human interaction and should report back quickly.

Security

Both daisy.ubuntu.com and errors.ubuntu.com are signed with the Ubuntu certificate from Go Daddy. This certificate is included by default in the ssl-certs package.

Downloading seems safe.
Uploading
- Mirror package uploading permissions to keep the same model we have today?
  - Query LP ACLs
  - Or we can keep it simple for now and restrict it to ~core-dev
Whoopsie needs to check the SSL certificate (strict server checking in libcurl)

Peer review

Would like some other core dev to at least review it. You get some pressure to fix this urgently and you rush something out.

Non-interactive

How do we avoid password prompts?
- One way would be a privileged apport dbus service that always services requests from the admin group.
- Apport now uses pkexec rather than gksu.
We need to prevent hooks from running arbitrary code as root.
- Not arbitrary code. We’re trusting the same group of developers as the archive (~core-dev).
  - attach_root_command maps to com.ubuntu.apport.package-hook
  - Martin will have a think about this.

Web interface

Need to sort out ~canonical not working with SSO in errors.ubuntu.com first (1073466)
Web UI restricted to ~core-dev.
Need UI for peer review system.

Types

Per-problem hook (by SAS). How do we map all the SASes to the signature?
Package level hook
We're purposefully leaving out global, problem type-specific hooks for now as these should ideally live in the apport package.

Keeping existing hooks for ubuntu-bug

Fix bug whereby existing hooks are running on released versions (1084979)
Keep daisy hooks on the apport-gtk path, run existing hooks on ubuntu-bug path
SRU whoopsie and apport for server-side hook changes

Expiry

quantity (if we received this 10 times, stop)
timeout (it is still running code, so after a week stop collecting)
Both quantity and timeout are editable fields with upper bounds
not-editable disk size upper bound (hooks can only include X MB)
Disable the hook once it hits the upper size bound. If the hook hits this while receiving data from the client, drop the connection and disable the hook.
We also need a disk upper bound on the client side, just in case. 50MB or so?

How do we get the reports with these fields?

They'll live in the HookOOPS CF. Needs to refer back to the original OOPS.
We need to be able to surface these on the problem page
Can be done as an API first for expediency
Daisy and the client pass a token back and forth and that token gets mapped to the correct bucket ID.
We need a way of seeing just package hook results. Deferred discussion until Matthew gets here.
Hook this into the alerts system as well. Notification when a report comes in with that information.

Delivery mechanism

Provide a list (BSON) of URLs to the individual hooks. It doesn’t need the package name or the bucket ID because it already has those locally.

We send to the client with a token that maps to the bucket. In the simple case this would be the Crash Signature that the SAS maps to. We still need a SAS to Crash Signature mapping so we can tell the right clients to get the specific hook for a crash signature.

Release specific hooks?

Restrict to a particular release upfront?

No, package hooks should check the DistroRelease field, when needed. Maybe DistroRelease checking should be in the web UI template for the hooks (if DistroRelease == ‘Ubuntu 13.04’:)

Compression

If there is time in the initial implementation, we should evaluate using compression for the increasingly large amount of data transfered between whoopsie and daisy. Candidates for this are snappy and zlib. xz could also be considered if we keep decompression time low on the server.

We should work with the webops team on this, as they may have strong opinions on the implementation. We will only get a few instances of this extra information out of the 100,000 reports we receive a day, given the limits we're putting on size, so this shouldn’t overload us.

Sending

Whoopsie gets a few hooks, downloads them, runs them with the python-apport code to update the report, sends the new keys in the report back (not any existing or modified), report gets written to a new CF

Static analysis / error handling and sending that back

Include the exception from the hook as a new field. Send that to daisy. Daisy uses the presence of this field to indicate a failed hook.

Failed hooks should send an alert to the hook creator via SMS (mup).

Audit

Report of active hooks would be useful. Page on these are the hooks running now with expiry date and the number of reports they’ve received, also the size of the data that’s been transmitted to us from those hooks.

We should record by day hook usage statistics (active, inactive, working, failed) to determine whether or not the hook mechanism is actually being used and working.

Test mechanism

Script to make testing a new hook against a local system or canonistack system easy

Ubuntu Wiki