2012-07-19-oneiric-kernel-regression-kills-EC2-instances

Owner: Pete Graner

Incident Description

The linux kernel 3.0.0.-23.38 update appears to have exposed a regression that kills off EC2 instances. Any user who updates has a dead EC2 instance. instance-store users will lose all data. EBS users will have a very painful recovery path.

This issue is unique to Oneiric kernels.

Crisis Response Team

  • Ben Howard (utlemming)
  • Brad Figg (bjf)
  • Stefan Bader (smb)
  • Kate Stewart (skaet)
  • Adam Conrad (infinity)
  • Leann Ogasawara (ogasawara)
  • Antonio Rosales (arosales)
  • Carlos de Avillez (hggdh)

Events

All times are in UTC.

  • 2012-07-19 16:24 - https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/1026690 opened

  • 2012-07-19 16:25 - on #ubuntu-release Ben Howard comments on critical kernel regression and asks how to back out bad SRU.
  • 2012-07-19 16:31 - Kate Stewart points Ben to Brad Figg to work out best path forward.
  • 2012-07-19 16:33 - in #ubuntu-kernel, Scope of impact starts getting discussed between Ben, Brad, Stefan.
  • 2012-07-19 16:50 - Kate starts off incident report.
  • 2012-07-19 16:57 - Adam Removing packages: linux-image-3.0.0-23-virtual 3.0.0-23.38 in oneiric i386. Comment: Broken SRU; blows up EC2 instances
  • 2012-07-19 16:58 - Kernel team has verified that this issue is unique to Oneiric. The bad commit exists in other kernels however, the commit that fixes the issue also exists in the other kernels.
  • 2012-07-19 17:07 - Adam volunteers to help fast track fix through SRU process.
  • 2012-07-19 17:30 - QA identified that the EC2 i386 test was testing the wrong architecture.
  • 2012-07-19 17:44 - Kernel team believes it has identified the problem commit.
  • 2012-07-19 17:44 - Kernel team building test kernels and preparing for a quick upload.
  • 2012-07-19 17:56 - Tracking bug created: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1026730

  • 2012-07-19 18:11 - Test kernel handed off for testing, confirmed to fix issue.
  • 2012-07-19 18:14 - Positive test results from Ben with the test kernel Stefan built.
  • 2012-07-19 18:40 - Positive test results from Carlos with the test kernel Stefan built. Full QA tests kicked off.
  • 2012-07-19 19:30 - New kernel package uploaded with the fix.
  • 2012-07-19 20:04 - Full QA tests passed (Stefan test kernel).
  • 2012-07-19 20:52 - Kate hands off to Brad to take point on the incident.
  • 2012-07-19 21:24 - Carlos questions issue he's seen, after discussion turns out to not be regression, kernel declared good (found error in one of the tests, but determined it is not a regression. The bug for this error is https://bugs.launchpad.net/qa-regression-testing/+bug/1026853 )

  • 2012-07-20 03:18 - Brad indicates new kernel is ready to be copied to -proposed in the tracking bug
  • 2012-07-20 03:17 - Bug is opened for the lucid backport kernel. https://bugs.launchpad.net/kernel-sru-workflow/promote-to-proposed/+bug/1026884

  • 2012-07-20 16:40 - Colin processes https://launchpad.net/ubuntu/+source/linux/3.0.0-23.39 into -proposed in response to ping from Brad in #ubuntu-release.

  • 2012-07-20 16:52 - Brad requests explicit testing by Carlos and Ben be done before passing fix through to -updates.
  • 2012-07-20 17:02 - Pete, LeAnn, Antonio confirm that once oneiric kernel is tested it should be released on Friday.

  • 2012-07-20 17:25 - Ben confirms new oneiric kernel in -proposed works in #ubuntu-release
  • 2012-07-20 18:05 - Carlos starts to test the new oneiric kernel in -proposed
  • 2012-07-20 18:17 - Adam cleans up tracking bug task to indicate kernel had been copied to -proposed.
  • 2012-07-20 21:20 - Carlos signs off on the tracking bug.
  • 2012-07-20 21:55 - Adam copies oneiric kernel to -updates.
  • 2012-07-21 00:02 - Adam mentions to Kate and Carlos presence of the other tracking bug for lts-backport-oneiric.
  • 2012-07-21 00:24 - Carlos starts testing of backport kernel in bug 1026884
  • 2012-07-21 01:48 - Carlos finishes testing and signs off on tracking bug 1026884
  • 2012-07-21 03:02 - Adam releases lts-backport-oneiric kernel to lucid-updates

Successes

Problems

  • QA tests do not verify they are testing the intended configuration. A test that believes it is testing i386 was in fact running on a amd64 kernel.
  • No updates to incident report timeline between 7/19 2100 and 7/20 1700 - tracking bug wasn't recorded, no relevant updates in IRC #ubuntu-kernel or #ubuntu-release channels, status was unclear on Friday AM.
  • Backport kernel was overlooked for testing.

Recommendations

  • Every QA kernel test should test to see if the test is running on the correct hardware. If the name of the job is supposed to have then arch type in it, that should be parsed and compared with the arch of the kernel installed on the test system. This same verification should happen for any other parameters that can be verified.

IncidentReports/2012-07-19-oneiric-kernel-regression-kills-EC2-instances (last edited 2012-07-21 06:37:13 by adconrad)