2011-02-25-Permissions-build-failures

Owner: Kate Stewart

Incident Description

Packages were built with incorrect permissions. Most failed to build, but a few succeeded despite the error.

Crisis Response Team

  • Colin Watson
  • LaMont Jones

  • Robert Collins
  • William Grant
  • Kate Stewart
  • Jamie Strandboge
  • Pete Graner
  • Steve Magoun
  • Chris Coulson
  • Cody Somerville
  • James Troup
  • James Westby

Events

  • 2011-02-25 16:07 Martin Pitt notices a strange hal build failure, and asks LaMont Jones to investigate. Results are inconclusive, although LaMont suggests possible umask issues.

  • 2011-02-25 17:54 Chris Coulson asks the same question. Results remain inconclusive.
  • 2011-02-25 19:26 Further discussion between Chris Coulson, Ken VanDine, and Colin Watson.

  • 2011-02-25 20:32 Colin Watson diffs build logs and determines that there are no updates to natty which might have been relevant.
  • 2011-02-25 20:36 Colin asks #launchpad-ops for help.
  • 2011-02-25 20:50 Colin realises that some packages are being misbuilt containing files with mode 0600, and rings the IS emergency phone. James Troup answers but is not at a computer, and says he'll track down another sysadmin.
  • 2011-02-25 20:57 Colin escalates up his management chain. Robbie Williamson is on holiday; Rick Spencer answers but is also on holiday, and suggests contacting Elliot Murphy (IS) and Kate Stewart (handover within platform). When Colin gets off the phone, LaMont has already responded, so there was no need to contact Elliot.

  • 2011-02-25 21:10 Colin belatedly disables the Launchpad publisher to try to limit damage.
  • 2011-02-25 21:13 LaMont disables the build daemons.

  • 2011-02-25 21:21 Kate Stewart responds; Colin hands over.
  • 2011-02-25 21:30 Kate phone discussion with Rick Spencer(holiday) and Elliot Murphy (traveling).
  • 2011-02-25 21:45 Rick calls Pete Graner asking to get involved and help Kate and company.
  • 2011-02-25 21:52 Pete sent out identi.ca message, to stop downloading natty builds.
  • 2011-02-25 22:03 Kate asks Jamie Strandboge to join into discussion on #launchpad-ops.
  • 2011-02-25 22:04 LaMont and Robert trying to identify root cause and scope of impact on #launchpad-ops.

  • 2011-02-25 22:06 LaMont confirmed ARM and PPC not affected - hardy based buildds to remain on manual until root cause established. Ongoing discussion with Jamie and Robert to investigate changes.

  • 2011-02-25 22:38 LaMont identifies upgrading sudo Version: 1.7.2p1-1ubuntu5.3 differs from Version: 1.6.9p10-1ubuntu3.8 at 11:39 as cause. i386 and amd64 updated with new sudo at 14:39 UTC.

  • 2011-02-25 22:46 Jamie Strandboge confirms sudo update as likely cause.
  • 2011-02-25 22:40 LaMont proposed plan from here: lamont to revert the sudoish changes on the buildds, wgrant to find out what build records we need to retry, buildds can go back on auto as sudo is downreved and sudoers fixed

  • 2011-02-25 23:00 William start to assemble queries to pull information on builds done in affected period.
  • 2011-02-25 23:14 Initial cut at affected archives (including private) determined.
  • 2011-02-25 23:20 Kate contacts Pat McGowan for reviewer of OEM private packages, Steve Magoun is nominated and phoned.

  • 2011-02-25 23:34 Steve joins channel, and starts to review affected archives.
  • 2011-02-25 23:43 Steve identifies private OEM PPAs that should not be made public
  • 2011-02-25 23:47 Cody notice there are some private builds for Linaro in the list, and informs LaMont and Kate.

  • 2011-02-25 23:50 LaMont and William put together draft public list to be reviewed by Kate for publishing with owner/archive names and excludes the good archs.

  • 2011-02-26 00:00 Colin joins back, and recommends fixing bogus successes first, upload them wth a high score to force into the build, then turn builders back to manual and wait for publisher, before switching builds back to auto.
  • 2011-02-26 00:04 LaMont recommends notifying PPA owners directly for those special cases.

  • 2011-02-26 00:05 LaMont confirms Colin's recommendation. William to check successes manually, and LaMont to requeue failures.

  • 2011-02-26 00:09 Kate notices on revised public list an entry for "canonical-payment-service", and asks Steve about it.
  • 2011-02-26 00:11 Discussion leads to decision that ISD needs to make the call about public nature of "canonical-payment-service". Decision made (based on time zones, travel plans, and availability) for Kate to contact Ricardo Kirkner.
  • 2011-02-26 00:21 Kate talks to Ricardo, and confirms "canonical-payment-service" shouldn't be public, and Ricardo will join in #launchpad_ops to discuss further.
  • 2011-02-26 00:26 LaMont starts Natty uploads to rebuild with William.

  • 2011-02-26 00:38 William gets list of broken packages
  • 2011-02-26 00:39 Cody starts off incident report to track ISD issues separate from this thread.
  • 2011-02-26 00:45 Steve confirms OEM packages are OK, but is unable to check oem-services-qa PPA since he doesn't have access. He sent a note to cgregan explaining the situation.
  • 2011-02-26 00:50 LaMont reuploads the broken packages, switches the publisher back on.

  • 2011-02-26 01:04 William checked all primary archive builds and the rebuilt broken ones. Now only PPAs remaining.
  • 2011-02-26 01:12 LaMont reports publisher enabled, buildd-manager is running and key packages scored up for rebuilding. Engaging a couple of builders to work through packages.

  • 2011-02-26 01:24 LaMont sets builders back to auto.

  • 2011-02-26 01:30 William will check for PPA failures and email the owners of affected PPAs.
  • 2011-02-26 01:31 William updates launchpadstatus
  • 2011-02-26 01:45 Kate produces revised public list with ISD packages removed and asks Steve and Ricardo to review public list

  • 2011-02-26 01:52 Kate updates identi.ca with information that builds are now active, and safe to download again.
  • 2011-02-26 01:55 Steve filed RT#44274 and RT#44275 for figuring out who has been into that ISD PPA
  • 2011-02-26 02:00 Kate sent out email to ubuntu-devel with summary of builder incident, and list of packages that could be made public, that were impacted, and should now be ok.

  • 2011-02-26 03:00-5:00 Kate monitored #ubuntu-devel for questions/issues - all quiet. Also updated this incident report.
  • 2011-02-26 08:26 William completes check of PPA failures, confirming that there was no fallout beyond the primary archive issues.
  • 2011-02-26 10:46 William notices a failed firefox build on artigas, suggesting that it still has the bad umask.
  • 2011-02-26 11:03 William sees a similar problem on crested, with firefox having failed and fpc built again with incorrect modes.
  • 2011-02-26 11:18 William sets all non-virtual builders except for armel and powerpc back to manual, pending analysis of the issues with at least two builders.
  • 2011-02-26 12:09 Kate wakes and checks backscrolls, and sees still have issue. Calls Colin and asks him to look into it.
  • 2011-02-26 12:17 Colin appears, disables publisher.
  • 2011-02-26 12:22 Decision made that this new issue doesn't require follow up on identi.ca
  • 2011-02-26 12:38 Kate hands off to Colin; and goes back to sleep.
  • 2011-02-26 12:58 Colin, having judged that the situation is non-critical and the publisher is not a risk, reenabled the publisher.
  • 2011-02-26 22:27 James appears, begins investigation, discovers that sudo was not downgraded on artigas, crested, sejong, molybdenum, or zirconium. Eventually completes the downgrade.
  • 2011-02-27 00:12 William appears, and it becomes clear that nobody really knows which sudo invocation was actually the problem.
  • 2011-02-27 00:23 Colin suggests forcing the umask to 022 at the top of sbuild.
  • 2011-02-27 00:27 James starts applying the sbuild hack everywhere.
  • 2011-02-27 00:38 James finished hacking umask(022) into sbuild, begins to restart all slaves.
  • 2011-02-27 00:42 James realises that he bounced the slave on ross and adare, which were both in auto and building at the time.
  • 2011-02-27 00:43 William checks and confirms that the build has restarted, but they are still working.
  • 2011-02-27 00:45 James turns all builders back onto auto.
  • 2011-02-27 00:48 James completes cleanup of old build directories on ross and adare.
  • 2011-02-27 01:26 William retries the firefox builds and reuploads fpc, the only package that misbuilt again.
  • 2011-02-27 02:19 fpc has built successfully everywhere.

Successes

  • Damage contained quickly when recognized.
  • Notification to development community promptly.
  • Appropriate escalation of incident and handoff across timezones.
  • Able to get OEM and ISD contacts engaged on short notice.
  • Recognition of Linaro and ISD sensitive information prior to public disclosure of affected package lst.

Problems

  • initial sudo update of builders that triggered this incident without assessment of impact of package upgrade.
  • subsequent unavailability of IS staff to respond to problems caused by the update.
  • packages were found in the public list that should have been private.
  • expertise working with publisher needs to be better replicated across timezones.
  • Inadvertant posting of private references in earlier draft of this report.
  • The initial fix was incomplete, with the problematic package remaining on five builders.
  • Everybody left quickly after the initial fix, so the breakage was not noticed for several hours.
  • Nobody responded to the ubuntu-devel thread when the new breakage was discovered.
  • There was a lack of continuity between the first and second fixes, due to a combination of timezones, planes and failed alarms.

Recommendations

  • Review of build machines package upgrade policy.
  • Review of training on what should be public vs. private for ISD group.
  • Review of timezone based skills for dealing with build and publishing infrastructure, and arrange for training to cover any gaps identified.
  • [IS] to address the above (both skills and testing of changes), sbuild should become a proper part of Launchpad, and rolled out by the LOSAs. The current situation of most buildd-side code being maintained by lamont is suboptimal.

Resolutions

From IS

  • Regarding package upgrade policies: it was agreed that this was a rare and isolated incident. The existing precautions and coordination with Platform (regarding release timing) works well, however IS needs to be more careful about identifying what types of upgrades are in the sbuild path of execution. As we experienced, 'sudo' is not innocent. IS is now grouping sudo with apt, dpkg, and others, as potentially harmful.
  • IS should not have been performing these upgrades on a Friday afternoon, regardless of perceived impact. Going forward IS will not make configuration changes on the buildds on Friday afternoons, except in extreme circumstances (security updates, for example).
  • IS should especially have not been doing this while everyone was in the same time zone (at a Sprint), and about to leave the following day via plane. IS will avoid such situations in the future.

IncidentReports/2011-02-25-Permissions-build-failures (last edited 2011-03-23 17:36:00 by cschluti)