<> === Incident Description === http://launchpad.net/bugs/561151 === Crisis Response Team === * Andy Whitcroft * Stefan Bader * Jane Silber * Matt Zimmerman * Chris Jones * Zaid Al Hamami * Colin Watson * Robbie Williamson === Events === All times are in UTC. ==== 2010-04-12 ==== 03:xx:: Master bug [[http://launchpad.net/bugs/561151|bug 561151]] is filed 07:36:: dholbach brings the issue to #ubuntu-kernel team attention, Sarvatt helps diagnose, showing that upstream kernels are not affected 08:27:: kernel-team becomes aware of issue, assisting with ongoing analysis, it appears to be affecting specific system 10:00:: analysis suggests this may be related to an EC change. As upstream has this change too and testing has shown upstream unaffected, we suspect a conflict between this new change and an Ubuntu specific change to improve boot performance. Preparation and builds of test kernels with reversion of Ubuntu changes started. 12:58:: Jane notifies Matt of the problem, naming Andy as a point of contact. Pete Graner (kernel team manager) and Robbie Williamson (Pete's manager) are both unavailable due to time zones and travel. 13:00:: a number of new cases of the issue become reported, escalating the issue 13:00:: testing of the reverted ubuntu reversions show similar issues 13:00:: Matt takes managerial responsibility for the issue, reads `#ubuntu-kernel` to see what's happening 13:08:: Matt and Andy speak by phone to assess the severity of the issue, agree that blocking the package is appropriate 13:13:: Matt makes contact with Chris Jones, asks him to stand by 13:17:: Matt provides Chris with the list of filenames to be blocked, per Andy: {{{ linux-image-2.6.32-20-generic_2.6.32-20.29_amd64.deb (29.5 MiB) linux-image-2.6.32-20-preempt_2.6.32-20.29_amd64.deb (29.9 MiB) linux-image-2.6.32-20-server_2.6.32-20.29_amd64.deb (29.5 MiB) linux-image-2.6.32-20-386_2.6.32-20.29_i386.deb (29.5 MiB) linux-image-2.6.32-20-generic_2.6.32-20.29_i386.deb (29.5 MiB) linux-image-2.6.32-20-generic-pae_2.6.32-20.29_i386.deb (29.6 MiB) }}} 13:21:: Chris confirms that the package "has been blocked on ftpmaster.internal and removed from our archive servers" 13:22:: Matt enters `#canonical-support` and notifies the Canonical support team of the issue: {{{ hello all I'd like to make you aware of a serious-looking regression in lucid which is likely to affect Canonical staff mdz - please go ahead zaid_h, the kernel package version 2.6.32-20.29 includes a regression which is known to affect some/many ThinkPads IS has already blocked further downloads of the package the kernel team is working on a fix anyone who has installed that version already could potentially find that their system won't boot in which case they will need to select an older kernel using GRUB mdz - okie doke. Think pads mainly? Is there a bug # we could follow for updates? MagicFab, pmatulis, EtienneG:^^ zaid_h, I'm gathering those details now and creating an incident report at https://wiki.canonical.com/IncidentReports/2010-04-12-Lucid-Kernel understood, thanks mdz mdz, thx }}} 13:26:: Matt creates this incident report page 13:27:: Matt polls `#ubuntu-devel` for an archive administrator to remove the package, per [[UbuntuPlatform/DealingWithCrisis]] 13:28:: Scott Kitterman responds, but does not know whether he has the necessary privileges to run lp-remove-package.py 13:30:: Andy: test kernels carrying with the reversion of the suspect EC change kernels prepared 13:32:: Colin Watson responds 13:51:: 50+ confirmed cases via Launchpad reports 13:52:: Colin removes the affected packages: {{{ lp_archive@cocoplum:~/syncs$ lp-remove-package.py -u cjwatson -m 'temporarily remove due to bug 561151 affecting ThinkPad users' -b linux-image-2.6.32-20-386 linux-image-2.6.32-20-generic-pae linux-image-2.6.32-20-generic linux-image-2.6.32-20-preempt linux-image-2.6.32-20-server 2010-04-12 13:51:54 INFO creating lockfile 2010-04-12 13:52:00 INFO Removing candidates: 2010-04-12 13:52:00 INFO linux-image-2.6.32-20-386 2.6.32-20.29 in lucid i386 2010-04-12 13:52:00 INFO linux-image-2.6.32-20-generic-pae 2.6.32-20.29 in lucid i386 2010-04-12 13:52:00 INFO linux-image-2.6.32-20-generic 2.6.32-20.29 in lucid amd64 2010-04-12 13:52:00 INFO linux-image-2.6.32-20-generic 2.6.32-20.29 in lucid i386 2010-04-12 13:52:00 INFO linux-image-2.6.32-20-preempt 2.6.32-20.29 in lucid amd64 2010-04-12 13:52:00 INFO linux-image-2.6.32-20-server 2.6.32-20.29 in lucid amd64 2010-04-12 13:52:00 INFO Removed-by: Colin Watson 2010-04-12 13:52:00 INFO Comment: temporarily remove due to bug 561151 affecting ThinkPad users 2010-04-12 13:52:00 INFO 6 packages successfully removed. Confirm this transaction? [yes, no] yes 2010-04-12 13:52:14 INFO Transaction committed. 2010-04-12 13:52:14 INFO The archive will be updated in the next publishing cycle. }}} 14:00:: testing confirms EC change as the culprit: {{{ * (pre-stable) ACPI: EC: Allow multibyte access to EC - LP: #526354 }}} 14:00:: Robbie Williamson comes online, and Matt notifies him of the incident in progress 14:08:: Matt hands off responsibility to Robbie 14:22:: Daniel Holbach confirms the test kernel (2.6.32-20-generic #30~lp561151v201004121418) resolves the issue 14:35:: Andy had 3 confirmations and estimates updated binaries in the archive in 4 hours (18:30) 14:36:: Andy will continue to investigate the reason for the bug, as the fix released will revert some Dell machines back into http://bugs.launchpad.net/bugs/526354 14:45:: apw: additional testing shows 6/6 thinkpads resolved, also this seems to fix 2/2 mac books 15:15:: apw: uploaded updated kernel 2.6.32-20.30, builds started for i386 and amd64 21:40:: Packages for [[https://launchpad.net/ubuntu/+source/linux/2.6.32-20.30/+build/1687719|i386]]/[[https://launchpad.net/ubuntu/+source/linux/2.6.32-20.30/+build/1687717|amd64]] complete building. 23:01:: Packages are pushed out onto main archive machines and top-tier mirrors. {{{ Tue, 13 Apr 2010 00:01:03 +0100: External archive mirror triggers completed. }}} === Successes === * Early and direct reporting of the issues to the kernel team by affected employees got resolution of the issue started several hours sooner than it would otherwise have been. If you are affected by a problem and wonder if you should tell someone, tell someone. (AndyWhitcroft) === Problems === * We were slow to realise this was a generic issue affecting all ThinkPads, leading to the affected kernel being available for much longer than it should have been. (AndyWhitcroft) * The particular response chosen represented one side of a trade-off between minimising the number of affected users and enabling developers to work effectively. Once the kernel was made non-downloadable, all network-only upgrades and netboot installation tests became impossible, which may be a serious problem for the relevant developers in the crunch time immediately before FinalFreeze. (ColinWatson) * Because of changes to preserve a certain boot experience, some users had a hard time booting into older, installed kernels. (RobbieWilliamson) === Recommendations === * Better testing of kernel bug fixes '''post''' KernelFreeze (RobbieWilliamson) * I think the decision to block the download was the right one. With that said, the decision to block the download of pakcages should always take into account where we are in the release cycle, as it could potentially cause more harm than good. (RobbieWilliamson) * Investigate how we can provide access to the boot loader, while preserving the overall boot experience for the user (RobbieWilliamson)