Owner: Marc Deslauriers
This template is used to track events during a crisis or potential crisis. The goal is not to analyse the entire event, but rather to provide whiteboard-style communications with the key people involved in the reaction plan. If you are not directly involved, do not speculate on pages of this type.
A PAM security update on 2011-05-30 caused cron to stop working. Affected releases are 8.04 LTS, 10.04 LTS, 10.10 and 11.04.
Crisis Response Team
- Marc Deslauriers (security)
- Chris Jones (IS)
- Colin Watson (archive)
- Lamont Jones (soyuz)
2011-05-30 USN-11401-1 was published. Security team's tools identified added library symbols
2011-05-31 02:49: Canonical support opens customer case #18265 regarding a problem with libpam0g and cron
2011-05-31 09:10: Marc Deslauriers notices bug, reproduces successfully and starts working on a fix
2011-05-31 09:44: Marc Deslauriers follows up to the bug, giving a workaround for affected users (restart cron) and saying that he is working on a fix.
2011-05-31 12:02: Marc realizes severity of issue and asks for downloads of the broken security update to be blocked on the archive. Did not receive an immediate response from #is, so Marc called the emergency number.
2011-05-31 12:17: Marc uploads fixes to the security PPA for Lucid - Natty. Hardy would not compile. Hands off Hardy diagnosis to Colin Watson and begins testing Lucid - Natty.
2011-05-31 12:33: Colin Watson removes the affected package versions, and starts a manual publisher run.
2011-05-31 12:34: Chris Jones blocks internal archive/ports machines from updating from cocoplum, removes the .debs for those versions, and pushes all available archive mirrors to remove those packages.
2011-05-31 12:37: Colin posts an ubuntustatus notice and starts to diagnose Hardy compile problem.
2011-05-31 13:03: Marc contacts Matt Zimmerman via IRC as point of escalation (Jamie S. on vacation, Rick S. not online yet)
2011-05-31 13:09: Publisher run (see 12:33) finishes.
2011-05-31 13:38: Matt receives Marc's message and reads the incident report
2011-05-31 13:41: Matt and Marc have a voice call to sync up, confirm that Matt understands status and plan
2011-05-31 13:49: Matt contacts Martin S. via IRC to engage the support team
2011-05-31 14:55: Jamie S comes online, brought up to speed (everything in hand at this point)
2011-05-31 15:11: After discovering the compile problem on Hardy, Marc uploads fixes to security PPA for Hardy
2011-05-31 15:25: Jamie S helps with USN text
2011-05-31 16:10: Marc lets IS know that Hardy amd64 is not building (no builders)
2011-05-31 16:21: Marc asks #soyuz to kill a hung builder
2011-05-31 16:33: Lamont kills the hung build
2011-05-31 16:35: Hardy amd64 starts to build
2011-05-31 16:47: Marc finishes testing updated packages and releases them
2011-05-31 17:37: Publisher finished publishing binaries to the archive
2011-05-31 17:40: Marc publishes USN-1140-2. The issue is now resolved.
2011-05-31 17:46: Colin posts an updated ubuntustatus notice.
All times are in UTC.
<Build a chronological list of events as they unfold.>
<Identify positive things that happened. What went right in the course of our response?>
- escalation procedures were prompt and properly followed
- freezing the archive rapidly and smoothly
- patches/testing for Lucid - Natty went smoothly and rapidly
- provided workarounds in the bug
- pointed people to the bug in mailing lists, irc, and social networking (eg, identica/ubuntustatus)
- creating an update that would fix the issue both for people who restarted daemons, and for people who didn't
- once resolved, communicating the issue promptly to ubuntu-security-announce, ubuntu planet, twitter, identica, etc
<Identify problems with the events. What went wrong in the course of our response?>
- Security team's tools identified added symbols but did not warn this would be a problem with cron. While the added symbols were considered, nothing in testing indicated this would be a problem with the way certain running daemons interacted with pam (eg, GDM and ssh worked properly, but cron was not tested).
- QA of the updated pam packages before publication did not notice the cron daemon had stopped working
- Automatic updates rely heavily on cron (single point of failure)
Certain pam updates need service restarts. A related blueprint dealing with distribution upgrades could fix this issue.
- took almost 1.5 hours for hardy to start building (lack of builders)
- Canonical support did not alert the security team to the problem
- Not all the right people are alerted via '!regression-alert' in #ubuntu-devel (currently cjwatson, jdong, pitti, slangasek, ScottK, mdz, kees, ttx, marjo, seb128'
<Suggest changes to process to minimize problems in the future. These should correspond to the problems identified above.>
DONE: Update QRT documentation about pam ABI changes (pointing at /var/lib/dpkg/info/libpam0g:$arch.postinst for a potential list of problematic packages)
DONE: Update security team tools to FAIL/WARN/INFO as needed when a pam ABI change is detected (tools already check ABI, just need to add some pam specific logic)
DONE: QRT test-pam.py should verify cron is still working
INPROGRESS: Marc to follow up witn Michael Vogt on how to improve automatic update handling to have a fallback in case of cron failure.
- Should cron restart itself?
- Should unattended-upgrades better detect this situation (eg, checking timestamps or other)?
- Should cron error handler call an emergency script which tells unattended updates to see if an update is available?
DONE: Investigate way to increase number of builders available for high priority updates (filed 795037)
DONE: Jamie to follow up with Canonical support so that Ubuntu Platform gets notifications sooner
DONE: Jamie to follow up in #ubuntu-ops on adjusting regression-alert. This has been changed to 'cjwatson, jdong, pitti, skaet, ScottK, kees, Daviey, pgraner' (tech leads for foundations, desktop, server, and security, the release manager, the QA manager (in lieu of tech lead) and key community members)