2012-07-19-oneiric-kernel-regression-kills-EC2-instances

Differences between revisions 1 and 23 (spanning 22 versions)
Revision 1 as of 2011-09-21 03:44:27
Size: 3895
Editor: vorlon
Comment:
Revision 23 as of 2012-07-19 18:06:33
Size: 1793
Editor: 74-95-45-185-Oregon
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Owner: Steve Langasek Owner: ?
Line 7: Line 7:
In preparation for the Oneiric Beta 2 milestone, an update of the ca-certificates package to correct a bug preventing verification of SSL certificates via openssl on new installs caused /etc/ca-certificates/update.d to be run against existing oneiric systems on upgrade. When the ca-certificates-java package is installed as a dependency of openjdk, a hook script in this directory mistakenly removed the libnss3.so system directory, breaking many applications for affected users - including Network Manager, with the result that many desktop users would be unable to access the network after reboot. The linux kernel 3.0.0.-23.38 update appears to have exposed a regression that kills off EC2 instances. Any user who updates has a dead EC2 instance. instance-store users will lose all data. EBS users will have a very painful recovery path.

This issue is unique to Oneiric kernels.
Line 10: Line 12:
 * Steve Langasek
 * Adam Conrad
 * Matthias Klose
 * Jorge Castro
 * Marc Deslauriers
 * Christopher James Halse Rogers
 * Brad Marshall
 * Ben Howard (utlemming)
 * Brad Figg (bjf)
 * Stefan Bader (smb)
 * Kate Stewart (skaet)
 * Adam Conrad (infinity)
 * Antonio Rosales (arosales)
 * Carlos de Avillez (hggdh)
Line 22: Line 24:
 * 2011-09-15 20:03 - A new version of the openssl package is uploaded to oneiric that includes a fix for [[http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=628780]]. This fix introduces a regression in the behavior of c_rehash which is not noticed immediately because c_rehash is only run for /etc/ssl/certs on initial installation or upgrade of the ca-certificates package.
 * 2011-09-18 12:04 - A new version of ca-certificates-java is accepted by the release team via the milestone freeze queue to correct an incompatibility with the multiarch-enabled version of libnss3. This update introduces a bug in /etc/ca-certificates/update.d/jks-keystore which goes unnoticed because this script is only run when update-ca-certificates is run.
 * 2011-09-20 18:27 - Scott Moser escalates the c_rehash regression to the release team, after it is discovered in beta-2 preparation on the Ubuntu cloud images.
 * 2011-09-20 20:49 - Steve Langasek isolates the bug to a behavior change in c_rehash
 * 2011-09-20 21:02 - Steve submits a prospective fix to ca-certificates to work around the c_rehash behavior change for testing on #ubuntu-release. The update-ca-certificates command output shows the errrors from the ca-certificates-java hook, but no follow-up is done at the time.
 * 2011-09-20 21:55 - Steve uploads ca-certificates 20110502+nmu1ubuntu2 which includes a call to update-ca-certificates --fresh when upgrading.
 * 2011-09-20 20:02 - Adam Conrad reviews and accepts the ca-certificates upload.
 * 2011-09-21 00:00 - Bug:855171 is filed reporting that libnss3.so has disappeared on upgrade.
 * 2011-09-21 00:13 - Marc Deslauriers escalates Bug:855171 to #ubuntu-release
 * 2011-09-21 00:24 - Steve isolates the cause to the update-ca-certificates call and the broken ca-certificates-java hook.
 * 2011-09-21 00:25 - Christopher James Halse Rogers begins to debug the issue on #ubuntu-devel.
 * 2011-09-21 00:33 - Matthias Klose (ca-certificates-java maintainer) becomes aware of the issue and begins working with Chris et al. on #ubuntu-devel to resolve the bug.
 * 2011-09-21 00:35 - Jorge Castro posts an [[http://identi.ca/notice/84195934|ubuntustatus notice]] and works on notifying users via the Ubuntu Forums.
 * 2011-09-21 00:59 - Brad Marshall disables downloads of ca-certificates on the mirrors.
 * 2011-09-21 01:00 - Steve and Matthias prepare a fix for ca-certificates-java that corrects the library-removing bug
 * 2011-09-21 01:06 - Steve uploads the fixed ca-certificates-java package
 * 2011-09-21 01:18 - Adam reviews and accepts the ca-certificates-java package
 * 2011-09-21 01:30 - After further discussion on #ubuntu-release with Adam, Steve uploads a new version of ca-certificates that Conflicts with the old ca-certificates-java to help further limit the damage
 * 2011-09-21 01:35 - Adam reviews and accepts the ca-certificates package
 * 2011-09-21 02:02 - Fixed packages are published.
 * 2011-09-21 02:19 - Brad re-enables mirroring.
 * 2012-07-19 16:24 - https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/1026690 opened
 * 2012-07-19 16:25 - on #ubuntu-release Ben Howard comments on critical kernel regression and asks how to back out bad SRU.
 * 2012-07-19 16:31 - Kate Stewart points Ben to Brad Figg to work out best path forward.
 * 2012-07-19 16:33 - in #ubuntu-kernel, Scope of impact starts getting discussed between Ben, Brad, Stefan.
 * 2012-07-19 16:50 - Kate starts off incident report.
 * 2012-07-19 16:57 - Adam Removing packages: linux-image-3.0.0-23-virtual 3.0.0-23.38 in oneiric i386. Comment: Broken SRU; blows up EC2 instances
 * 2012-07-19 16:58 - Kernel team has verified that this issue is unique to Oneiric. The bad commit exists in other kernels however, the commit that fixes the issue also exists in the other kernels.
 * 2012-07-19 17:30 - QA identified that the EC2 i386 test was testing the wrong architecture.
 * 2012-07-19 17:44 - Kernel team believes it has identified the problem commit.
 * 2012-07-19 17:44 - Kernel team building test kernels and preparing for a quick upload.

=== Successes ===
 *
=== Problems ===
 *
=== Recommendations ===
 *

Owner: ?

Incident Description

The linux kernel 3.0.0.-23.38 update appears to have exposed a regression that kills off EC2 instances. Any user who updates has a dead EC2 instance. instance-store users will lose all data. EBS users will have a very painful recovery path.

This issue is unique to Oneiric kernels.

Crisis Response Team

  • Ben Howard (utlemming)
  • Brad Figg (bjf)
  • Stefan Bader (smb)
  • Kate Stewart (skaet)
  • Adam Conrad (infinity)
  • Antonio Rosales (arosales)
  • Carlos de Avillez (hggdh)

Events

All times are in UTC.

  • 2012-07-19 16:24 - https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/1026690 opened

  • 2012-07-19 16:25 - on #ubuntu-release Ben Howard comments on critical kernel regression and asks how to back out bad SRU.
  • 2012-07-19 16:31 - Kate Stewart points Ben to Brad Figg to work out best path forward.
  • 2012-07-19 16:33 - in #ubuntu-kernel, Scope of impact starts getting discussed between Ben, Brad, Stefan.
  • 2012-07-19 16:50 - Kate starts off incident report.
  • 2012-07-19 16:57 - Adam Removing packages: linux-image-3.0.0-23-virtual 3.0.0-23.38 in oneiric i386. Comment: Broken SRU; blows up EC2 instances
  • 2012-07-19 16:58 - Kernel team has verified that this issue is unique to Oneiric. The bad commit exists in other kernels however, the commit that fixes the issue also exists in the other kernels.
  • 2012-07-19 17:30 - QA identified that the EC2 i386 test was testing the wrong architecture.
  • 2012-07-19 17:44 - Kernel team believes it has identified the problem commit.
  • 2012-07-19 17:44 - Kernel team building test kernels and preparing for a quick upload.

Successes

Problems

Recommendations

IncidentReports/2012-07-19-oneiric-kernel-regression-kills-EC2-instances (last edited 2012-07-21 06:37:13 by adconrad)