2010-04-12-Lucid-Kernel

Incident Description

http://launchpad.net/bugs/561151

Crisis Response Team

  • Andy Whitcroft
  • Stefan Bader
  • Jane Silber
  • Matt Zimmerman
  • Chris Jones
  • Zaid Al Hamami
  • Colin Watson
  • Robbie Williamson

Events

All times are in UTC.

2010-04-12

03:xx

Master bug bug 561151 is filed

07:36
dholbach brings the issue to #ubuntu-kernel team attention, Sarvatt helps diagnose, showing that upstream kernels are not affected
08:27
kernel-team becomes aware of issue, assisting with ongoing analysis, it appears to be affecting specific system
10:00
analysis suggests this may be related to an EC change. As upstream has this change too and testing has shown upstream unaffected, we suspect a conflict between this new change and an Ubuntu specific change to improve boot performance. Preparation and builds of test kernels with reversion of Ubuntu changes started.
12:58
Jane notifies Matt of the problem, naming Andy as a point of contact. Pete Graner (kernel team manager) and Robbie Williamson (Pete's manager) are both unavailable due to time zones and travel.
13:00
a number of new cases of the issue become reported, escalating the issue
13:00
testing of the reverted ubuntu reversions show similar issues
13:00

Matt takes managerial responsibility for the issue, reads #ubuntu-kernel to see what's happening

13:08
Matt and Andy speak by phone to assess the severity of the issue, agree that blocking the package is appropriate
13:13
Matt makes contact with Chris Jones, asks him to stand by
13:17

Matt provides Chris with the list of filenames to be blocked, per Andy:

linux-image-2.6.32-20-generic_2.6.32-20.29_amd64.deb (29.5 MiB)
  linux-image-2.6.32-20-preempt_2.6.32-20.29_amd64.deb (29.9 MiB)
  linux-image-2.6.32-20-server_2.6.32-20.29_amd64.deb (29.5 MiB)
  linux-image-2.6.32-20-386_2.6.32-20.29_i386.deb (29.5 MiB)
  linux-image-2.6.32-20-generic_2.6.32-20.29_i386.deb (29.5 MiB)
  linux-image-2.6.32-20-generic-pae_2.6.32-20.29_i386.deb (29.6 MiB)
13:21
Chris confirms that the package "has been blocked on ftpmaster.internal and removed from our archive servers"
13:22

Matt enters #canonical-support and notifies the Canonical support team of the issue:

<mdz> hello all
 I'd like to make you aware of a serious-looking regression in lucid which is likely to affect Canonical staff
<zaid_h> mdz - please go ahead
<mdz> zaid_h, the kernel package version 2.6.32-20.29 includes a regression which is known to affect some/many ThinkPads
 IS has already blocked further downloads of the package
 the kernel team is working on a fix
 anyone who has installed that version already could potentially find that their system won't boot
 in which case they will need to select an older kernel using GRUB
<zaid_h> mdz - okie doke. Think pads mainly? Is there a bug # we could follow for updates?
<zaid_h> MagicFab, pmatulis, EtienneG:^^
<mdz> zaid_h, I'm gathering those details now and creating an incident report at https://wiki.canonical.com/IncidentReports/2010-04-12-Lucid-Kernel
<EtienneG> understood, thanks mdz
<zaid_h> mdz, thx
13:26
Matt creates this incident report page
13:27

Matt polls #ubuntu-devel for an archive administrator to remove the package, per UbuntuPlatform/DealingWithCrisis

13:28
Scott Kitterman responds, but does not know whether he has the necessary privileges to run lp-remove-package.py
13:30
Andy: test kernels carrying with the reversion of the suspect EC change kernels prepared
13:32
Colin Watson responds
13:51
50+ confirmed cases via Launchpad reports
13:52

Colin removes the affected packages:

lp_archive@cocoplum:~/syncs$ lp-remove-package.py -u cjwatson -m 'temporarily remove due to bug 561151 affecting ThinkPad users' -b linux-image-2.6.32-20-386 linux-image-2.6.32-20-generic-pae linux-image-2.6.32-20-generic linux-image-2.6.32-20-preempt linux-image-2.6.32-20-server
2010-04-12 13:51:54 INFO    creating lockfile
2010-04-12 13:52:00 INFO    Removing candidates:
2010-04-12 13:52:00 INFO        linux-image-2.6.32-20-386 2.6.32-20.29 in lucid i386
2010-04-12 13:52:00 INFO        linux-image-2.6.32-20-generic-pae 2.6.32-20.29 in lucid i386
2010-04-12 13:52:00 INFO        linux-image-2.6.32-20-generic 2.6.32-20.29 in lucid amd64
2010-04-12 13:52:00 INFO        linux-image-2.6.32-20-generic 2.6.32-20.29 in lucid i386
2010-04-12 13:52:00 INFO        linux-image-2.6.32-20-preempt 2.6.32-20.29 in lucid amd64
2010-04-12 13:52:00 INFO        linux-image-2.6.32-20-server 2.6.32-20.29 in lucid amd64
2010-04-12 13:52:00 INFO    Removed-by: Colin Watson
2010-04-12 13:52:00 INFO    Comment: temporarily remove due to bug 561151 affecting ThinkPad users
2010-04-12 13:52:00 INFO    6 packages successfully removed.
Confirm this transaction? [yes, no] yes
2010-04-12 13:52:14 INFO    Transaction committed.
2010-04-12 13:52:14 INFO    The archive will be updated in the next publishing cycle.
14:00

testing confirms EC change as the culprit:

 * (pre-stable) ACPI: EC: Allow multibyte access to EC
     - LP: #526354
14:00
Robbie Williamson comes online, and Matt notifies him of the incident in progress
14:08
Matt hands off responsibility to Robbie
14:22
Daniel Holbach confirms the test kernel (2.6.32-20-generic #30~lp561151v201004121418) resolves the issue
14:35
Andy had 3 confirmations and estimates updated binaries in the archive in 4 hours (18:30)
14:36

Andy will continue to investigate the reason for the bug, as the fix released will revert some Dell machines back into http://bugs.launchpad.net/bugs/526354

14:45
apw: additional testing shows 6/6 thinkpads resolved, also this seems to fix 2/2 mac books
15:15
apw: uploaded updated kernel 2.6.32-20.30, builds started for i386 and amd64
21:40

Packages for i386/amd64 complete building.

23:01
Packages are pushed out onto main archive machines and top-tier mirrors.

Tue, 13 Apr 2010 00:01:03 +0100: External archive mirror triggers completed.

Successes

<Identify positive things that happened. What went right in the course of our response?>

  • Early and direct reporting of the issues to the kernel team by affected employees got resolution of the issue started several hours sooner than it would otherwise have been. If you are affected by a problem and wonder if you should tell someone, tell someone. (AndyWhitcroft)

Problems

<Identify problems with the events. What went wrong in the course of our response?>

  • We were slow to realise this was a generic issue affecting all ThinkPads, leading to the affected kernel being available for much longer than it should have been. (AndyWhitcroft)

  • The particular response chosen represented one side of a trade-off between minimising the number of affected users and enabling developers to work effectively. Once the kernel was made non-downloadable, all network-only upgrades and netboot installation tests became impossible, which may be a serious problem for the relevant developers in the crunch time immediately before FinalFreeze. (ColinWatson)

  • Because of changes to preserve a certain boot experience, some users had a hard time booting into older, installed kernels. (RobbieWilliamson)

Recommendations

<Suggest changes to process to minimize problems in the future. These should correspond to the problems identified above.>

  • Better testing of kernel bug fixes post KernelFreeze (RobbieWilliamson)

  • I think the decision to block the download was the right one. With that said, the decision to block the download of pakcages should always take into account where we are in the release cycle, as it could potentially cause more harm than good. (RobbieWilliamson)

  • Investigate how we can provide access to the boot loader, while preserving the overall boot experience for the user (RobbieWilliamson)

IncidentReports/2010-04-12-Lucid-Kernel (last edited 2010-04-28 09:53:34 by robbie.w)