2011-02-25-Permissions-build-failures

Differences between revisions 12 and 13
Revision 12 as of 2011-02-26 06:50:00
Size: 8
Editor: 99-191-111-134
Comment:
Revision 13 as of 2011-02-26 06:54:51
Size: 7502
Editor: 99-191-111-134
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
deleted <<TableOfContents>>

Owner:

=== Incident Description ===

Packages were built with incorrect permissions. Most failed to build, but a few succeeded despite the error.

=== Crisis Response Team ===

 * Colin Watson
 * La``Mont Jones
 * Robert Collins
 * William Grant
 * Kate Stewart
 * Jamie Strand
 * Pete Graner
 * Steve Magoun
 * Chris Coulson
 * Cody Somerville

=== Events ===

 * 2011-02-25 16:07 Martin Pitt notices a strange hal build failure, and asks La``Mont Jones to investigate. Results are inconclusive, although La``Mont suggests possible umask issues.
 * 2011-02-25 17:54 Chris Coulson asks the same question. Results remain inconclusive.
 * 2011-02-25 19:26 Further discussion between Chris Coulson, Ken Van``Dine, and Colin Watson.
 * 2011-02-25 20:32 Colin Watson diffs build logs and determines that there are no updates to natty which might have been relevant.
 * 2011-02-25 20:36 Colin asks #launchpad-ops for help.
 * 2011-02-25 20:50 Colin realises that some packages are being misbuilt containing files with mode 0600, and rings the IS emergency phone. James Troup answers but is not at a computer, and says he'll track down another sysadmin.
 * 2011-02-25 20:57 Colin escalates up his management chain. Robbie Williamson is on holiday; Rick Spencer answers but is also on holiday, and suggests contacting Elliot Murphy (IS) and Kate Stewart (handover within platform). When Colin gets off the phone, La``Mont has already responded, so there was no need to contact Elliot.
 * 2011-02-25 21:10 Colin belatedly disables the Launchpad publisher to try to limit damage.
 * 2011-02-25 21:13 La``Mont disables the build daemons.
 * 2011-02-25 21:21 Kate Stewart responds; Colin hands over.
 * 2011-02-25 21:30 Kate phone discussion with Rick Spencer(holiday) and Elliot Murphy (traveling).
 * 2011-02-25 21:45 Rick calls Pete Graner asking to get involved and help Kate and company.
 * 2011-02-25 21:52 Pete sent out identi.ca message, to stop downloading natty builds.
 * 2011-02-25 22:03 Kate asks Jamie Strand to join into discussion on #launchpad-ops.
 * 2011-02-25 22:04 La``Mont and Robert trying to identify root cause and scope of impact on #launchpad-ops.
 * 2011-02-25 22:06 La``Mont confirmed ARM and PPC not affected - hardy based buildds to remain on manual until root cause established. Ongoing discussion with Jamie and Robert to investigate changes.
 * 2011-02-25 22:38 La``Mont identifies upgrading sudo Version: 1.7.2p1-1ubuntu5.3 differs from Version: 1.6.9p10-1ubuntu3.8 at 11:39 as cause. i386 and amd64 updated with new sudo at 14:39 UTC.
 * 2011-02-25 22:46 Jamie Strand confirms sudo update as likely cause.
 * 2011-02-25 22:40 La``Mont proposed plan from here: lamont to revert the sudoish changes on the buildds, wgrant to find out what build records we need to retry, buildds can go back on auto as sudo is downreved and sudoers fixed
 * 2011-02-25 23:00 William start to assemble queries to pull information on builds done in affected period.
 * 2011-02-25 23:14 Initial cut at affected archives (including private) determined.
 * 2011-02-25 23:20 Kate contacts Pat McGowan for reviewer of OEM private packages, Steve Magoun is nominated and phoned.
 * 2011-02-25 23:34 Steve joins channel, and starts to review affected archives.
 * 2011-02-25 23:43 Steve identifies private OEM PPAs that should not be made public
 * 2011-02-25 23:47 Cody notice there are some private builds for Linaro in the list, and informs La``Mont and Kate.
 * 2011-02-25 23:50 La``Mont and William put together draft public list to be reviewed by Kate for publishing with owner/archive names and excludes the good archs.
 * 2011-02-26 00:00 Colin joins back, and recommends fixing bogus successes first, upload them wth a high score to force into the build, then turn builders back to manual and wait for publisher, before switching builds back to auto.
 * 2011-02-26 00:04 La``Mont recommends notifying PPA owners directly for those special cases.
 * 2011-02-26 00:05 La``Mont confirms Colin's recommendation. William to check successes manually, and La``Mont to requeue failures.
 * 2011-02-26 00:09 Kate notices on revised public list an entry for "canonical-payment-service", and asks Steve about it.
 * 2011-02-26 00:11 Discussion leads to decision that ISD needs to make the call about public nature of "canonical-payment-service". Decision made (based on time zones, travel plans, and availability) for Kate to contact Ricardo Kirkner.
 * 2011-02-26 00:21 Kate talks to Ricardo, and confirms "canonical-payment-service" shouldn't be public, and Ricardo will join in #launchpad_ops to discuss further.
 * 2011-02-26 00:26 La``Mont starts Natty uploads to rebuild with William.
 * 2011-02-26 00:38 William gets list of broken packages
 * 2011-02-26 00:39 Cody starts off incident report to track ISD issues separate from this thread.
 * 2011-02-26 00:45 Steve confirms OEM packages are OK, but is unable to check oem-services-qa PPA since he doesn't have access. He sent a note to cgregan explaining the situation.
 * 2011-02-26 00:50 La``Mont reuploads the broken packages, switches the publisher back on.
 * 2011-02-26 01:04 William checked all primary archive builds and the rebuilt broken ones. Now only PPAs remaining.
 * 2011-02-26 01:12 La``Mont reports publisher enabled, buildd-manager is running and key packages scored up for rebuilding. Engaging a couple of builders to work through packages.
 * 2011-02-26 01:24 La``Mont sets builders back to auto.
 * 2011-02-26 01:30 William will check for PPA failures and email the owners of affected PPAs.
 * 2011-02-26 01:31 William updates launchpadstatus
 * 2011-02-26 01:45 Kate produces revised public list with ISD packages removed and asks Steve and Ricardo to review [[http://paste.ubuntu.com/572428/|public list]]
 * 2011-02-26 01:52 Kate updates identi.ca with information that builds are now active, and safe to download again.
 * 2011-02-26 01:55 Steve filed RT#44274 and RT#44275 for figuring out who has been into that ISD PPA
 * 2011-02-26 02:00 Kate sent out email to ubuntu-devel with summary of builder incident, and [[http://paste.ubuntu.com/572428/|list of packages that could be made public]], that were impacted, and should now be ok.
 * 2011-02-26 03:00-5:00 Kate monitored #ubuntu-devel for questions/issues - all quiet. Also updated this incident report.


=== Successes ===

 * Damage contained quickly when recognized.
 * Notification to development community promptly.
 * Appropriate escalation of incident and handoff across timezones.
 * Able to get OEM and ISD contacts engaged on short notice.
 * Recognition of Linaro and ISD sensitive information prior to public disclosure of affected package lst.

=== Problems ===

 * initial sudo update of builders that triggered this incident without assessment of impact of package upgrade.
 * packages were found in the public list that should have been private.
 * expertise working with publisher needs to be better replicated across timezones.
 * Inadvertant posting of private references in earlier draft of this report.

=== Recommendations ===

 * Review of build machines package upgrade policy.
 * Review of training on what should be public vs. private for ISD group.
 * Review of timezone based skills for dealing with build and publishing infrastructure, and arrange for training to cover any gaps identified.

Owner:

Incident Description

Packages were built with incorrect permissions. Most failed to build, but a few succeeded despite the error.

Crisis Response Team

  • Colin Watson
  • LaMont Jones

  • Robert Collins
  • William Grant
  • Kate Stewart
  • Jamie Strand
  • Pete Graner
  • Steve Magoun
  • Chris Coulson
  • Cody Somerville

Events

  • 2011-02-25 16:07 Martin Pitt notices a strange hal build failure, and asks LaMont Jones to investigate. Results are inconclusive, although LaMont suggests possible umask issues.

  • 2011-02-25 17:54 Chris Coulson asks the same question. Results remain inconclusive.
  • 2011-02-25 19:26 Further discussion between Chris Coulson, Ken VanDine, and Colin Watson.

  • 2011-02-25 20:32 Colin Watson diffs build logs and determines that there are no updates to natty which might have been relevant.
  • 2011-02-25 20:36 Colin asks #launchpad-ops for help.
  • 2011-02-25 20:50 Colin realises that some packages are being misbuilt containing files with mode 0600, and rings the IS emergency phone. James Troup answers but is not at a computer, and says he'll track down another sysadmin.
  • 2011-02-25 20:57 Colin escalates up his management chain. Robbie Williamson is on holiday; Rick Spencer answers but is also on holiday, and suggests contacting Elliot Murphy (IS) and Kate Stewart (handover within platform). When Colin gets off the phone, LaMont has already responded, so there was no need to contact Elliot.

  • 2011-02-25 21:10 Colin belatedly disables the Launchpad publisher to try to limit damage.
  • 2011-02-25 21:13 LaMont disables the build daemons.

  • 2011-02-25 21:21 Kate Stewart responds; Colin hands over.
  • 2011-02-25 21:30 Kate phone discussion with Rick Spencer(holiday) and Elliot Murphy (traveling).
  • 2011-02-25 21:45 Rick calls Pete Graner asking to get involved and help Kate and company.
  • 2011-02-25 21:52 Pete sent out identi.ca message, to stop downloading natty builds.
  • 2011-02-25 22:03 Kate asks Jamie Strand to join into discussion on #launchpad-ops.
  • 2011-02-25 22:04 LaMont and Robert trying to identify root cause and scope of impact on #launchpad-ops.

  • 2011-02-25 22:06 LaMont confirmed ARM and PPC not affected - hardy based buildds to remain on manual until root cause established. Ongoing discussion with Jamie and Robert to investigate changes.

  • 2011-02-25 22:38 LaMont identifies upgrading sudo Version: 1.7.2p1-1ubuntu5.3 differs from Version: 1.6.9p10-1ubuntu3.8 at 11:39 as cause. i386 and amd64 updated with new sudo at 14:39 UTC.

  • 2011-02-25 22:46 Jamie Strand confirms sudo update as likely cause.
  • 2011-02-25 22:40 LaMont proposed plan from here: lamont to revert the sudoish changes on the buildds, wgrant to find out what build records we need to retry, buildds can go back on auto as sudo is downreved and sudoers fixed

  • 2011-02-25 23:00 William start to assemble queries to pull information on builds done in affected period.
  • 2011-02-25 23:14 Initial cut at affected archives (including private) determined.
  • 2011-02-25 23:20 Kate contacts Pat McGowan for reviewer of OEM private packages, Steve Magoun is nominated and phoned.

  • 2011-02-25 23:34 Steve joins channel, and starts to review affected archives.
  • 2011-02-25 23:43 Steve identifies private OEM PPAs that should not be made public
  • 2011-02-25 23:47 Cody notice there are some private builds for Linaro in the list, and informs LaMont and Kate.

  • 2011-02-25 23:50 LaMont and William put together draft public list to be reviewed by Kate for publishing with owner/archive names and excludes the good archs.

  • 2011-02-26 00:00 Colin joins back, and recommends fixing bogus successes first, upload them wth a high score to force into the build, then turn builders back to manual and wait for publisher, before switching builds back to auto.
  • 2011-02-26 00:04 LaMont recommends notifying PPA owners directly for those special cases.

  • 2011-02-26 00:05 LaMont confirms Colin's recommendation. William to check successes manually, and LaMont to requeue failures.

  • 2011-02-26 00:09 Kate notices on revised public list an entry for "canonical-payment-service", and asks Steve about it.
  • 2011-02-26 00:11 Discussion leads to decision that ISD needs to make the call about public nature of "canonical-payment-service". Decision made (based on time zones, travel plans, and availability) for Kate to contact Ricardo Kirkner.
  • 2011-02-26 00:21 Kate talks to Ricardo, and confirms "canonical-payment-service" shouldn't be public, and Ricardo will join in #launchpad_ops to discuss further.
  • 2011-02-26 00:26 LaMont starts Natty uploads to rebuild with William.

  • 2011-02-26 00:38 William gets list of broken packages
  • 2011-02-26 00:39 Cody starts off incident report to track ISD issues separate from this thread.
  • 2011-02-26 00:45 Steve confirms OEM packages are OK, but is unable to check oem-services-qa PPA since he doesn't have access. He sent a note to cgregan explaining the situation.
  • 2011-02-26 00:50 LaMont reuploads the broken packages, switches the publisher back on.

  • 2011-02-26 01:04 William checked all primary archive builds and the rebuilt broken ones. Now only PPAs remaining.
  • 2011-02-26 01:12 LaMont reports publisher enabled, buildd-manager is running and key packages scored up for rebuilding. Engaging a couple of builders to work through packages.

  • 2011-02-26 01:24 LaMont sets builders back to auto.

  • 2011-02-26 01:30 William will check for PPA failures and email the owners of affected PPAs.
  • 2011-02-26 01:31 William updates launchpadstatus
  • 2011-02-26 01:45 Kate produces revised public list with ISD packages removed and asks Steve and Ricardo to review public list

  • 2011-02-26 01:52 Kate updates identi.ca with information that builds are now active, and safe to download again.
  • 2011-02-26 01:55 Steve filed RT#44274 and RT#44275 for figuring out who has been into that ISD PPA
  • 2011-02-26 02:00 Kate sent out email to ubuntu-devel with summary of builder incident, and list of packages that could be made public, that were impacted, and should now be ok.

  • 2011-02-26 03:00-5:00 Kate monitored #ubuntu-devel for questions/issues - all quiet. Also updated this incident report.

Successes

  • Damage contained quickly when recognized.
  • Notification to development community promptly.
  • Appropriate escalation of incident and handoff across timezones.
  • Able to get OEM and ISD contacts engaged on short notice.
  • Recognition of Linaro and ISD sensitive information prior to public disclosure of affected package lst.

Problems

  • initial sudo update of builders that triggered this incident without assessment of impact of package upgrade.
  • packages were found in the public list that should have been private.
  • expertise working with publisher needs to be better replicated across timezones.
  • Inadvertant posting of private references in earlier draft of this report.

Recommendations

  • Review of build machines package upgrade policy.
  • Review of training on what should be public vs. private for ISD group.
  • Review of timezone based skills for dealing with build and publishing infrastructure, and arrange for training to cover any gaps identified.

IncidentReports/2011-02-25-Permissions-build-failures (last edited 2011-03-23 17:36:00 by pool-108-1-168-116)