IncidentReports

This page is based on the existing procedure for handling incident reports across Canonical, IncidentReports. It is intended to be used for documenting incidents where the landing process must be blocked due to regressions (TRAINCON-0).

2014-08-19:

Incident Description

Timeline

Successes

Problems

Recommendations

2014-08-06: worsening test situation on smoketesting, not enough progress on blockers

Incident Description

There was still one day until a 'normal' TRAINCON-0 would be announced (in the 6th day without promotion), but seeing that the test results and blockers fixes were not improving fast enough, as well as we were noticing more and more new issues appearing, we decided switching it on earlier. Another reason was also the nearing RTM branching period which was set to around Friday, for which we wanted to have a promoted image present.

One of the biggest culprits that caused this incident was a bug in running autopilot tests caused by apparmor denials. The apparmor denials have been caused, on the other hand, by the hwclock not setting the date correctly. This caused some cache timestamp mismatches and invalid behavior during testing on various occasions, leading to situations where we could not get a clean test situation for a given image. We also noticed that upstreams themselves did not put enough pressure on fixing the blockers and/or not introducing new issues. No promotions were happening and whenever one blocker has been fixed, a new one has been appearing in its stead. The decision to stop the line has been made.

Timeline

2014.08.06: after seeing the smoketesting results and not noticing enough velocity with blockers, TRAINCON-0 is announced

2014.08.07: smoketesting failure issue identified, problems finding the right person responsible, workaround it made in the infrastructure by Paul. Most blockers fixed: only date-time picker and camera-app remaining. Workarounds being prepared. Preliminary promotion-wise dogfooding prepared = things looking good.

2014.08.08: workarounds seemed to have some flaws here and there, so real fixes have been preferred. Finally we managed to test and land all fixes and build a new image. Late at night (UTC) the new image has been dogfooded and promoted to the devel channel. TRAINCON-0 removed.

Successes

This time more involvement could be seen from various upstream developers. Also, thanks to some additional care from the Landing Team members on pushing for fixes things seemed to go forward much faster.

Problems

It would seem that upstream landers do not put enough pressure on quality of their landings, and those introduce additional regressions. Also, everyone seemed to require an additional push from the landing team to move things forward. Another problem noticed (before TRAINCON-0) was that even after the trust-store location-service landing was signed-off by QA, it still came with a serious regression. This basically means that even a double-check from QA does not give us a 100% guarantee about the safety of a landing. Worth considering for the RTM period.

Recommendations

Involving engineering managers more in dealing with the TRAINCON-0 crisis, so that they additionally push on developers to get the fixes for blockers. Also, even though this was not a problem this time, but during TRAINCON-0 we should also intentionally decrease the velocity of landings. As even a QA sign-off does not give 100% guarantee of no regressions.

2014-07-28: no promotion for over 7 days because of various regressions and test issues

Incident Description

After a period of 7 business days without image promotion, we have announced TRAINCON-0 as per the rules defined in our landing process. The problem this time was a magnitude of many issues throughout the week. First blocking issues were related to stably reproducible autopilot test failures in some of our suites. This was very time consuming. After these got fixed, some problems making the images unbootable have surfaced (changes in user handling). All this - combined with some delays with chroot archive problems, have resulted in us entering TRAINCON-0. The traincon itself lasted 2 days because of some new blocking user-visible issues identified.

Timeline

2014.07.28: All test failures have been fixed after over a week of work, but image was not able to get built and tested on time - so TRAINCON-0 has been announced. Plan was to promote as soon as possible, but smoketesting and dogfooding revealed some new blockers. First one was the all-around mediascanner crashes, causing no media access. The second - Unity8 not starting on emulator images. Both have been fixed.

2014.07.29: With blockers fixed and smaller issues whitelisted, a relatively fair image has been built. Smoketesting appeared to show some problems with application launch related to apparmor denials, with the team composed of Paul and Jamie working on identifying the issue. Dogfooding did not reveal it being a real problem (which is not entirely true) and decided on promotion. TRAINCON-0 over.

Successes

-

Problems

There have been many problems during and leading to this TRAINCON-0.

First of all, the blocking issues that led to TRAINCON-0 took over a week to fix. There was high pressure put on the UITK test problems by both the landing team and the SDK team, who were blocked by those even locally, but in the end only a limited number of people were involved in the fixes. The SDK team was very responsive during that time though. All other autopilot issues got resolved relatively quickly.

Another situation that led to the incident was the explicit override of the CI Train security mechanisms reminding of uploading ubuntu-ui-toolkit GLES counterparts. CI Train normally does not allow a project to miss it's so-called 'twins' whenever they're needed, but this has been overridden and caused issues in the emulator. Some of the smaller things leading to TRAINCON-0 were things that could not have been predicted, like issues after OTA upgrade after the user handling changes etc.

After TRAINCON-0 was triggered, there were some issues that appeared before we could promote. The biggest problem here was the lack of hands to work. There were only a few people involved in resolving the new blockers while, as Oliver noticed, we would expect all upstreams to work hard on getting TRAINCON-0 brought down. There has been a lot of interest from some, but most people just waited on the situation getting resolved by itself.

The final day was also seemed to be problematic as we currently know one person of contact whenever apparmor-related issues appear.

Recommendations

As per ogra's proposition, first of all we would need to somehow force 'all hands on deck' from upstreams whenever a TRAINCON-0 is switched on. The TRAINCON-0 rules need to be cleanly defined with both actions required from the landing team and from the landers and upstreams themselves. Whenever an incident appears, everyone should work hard on getting back to a promotable image as soon as possible. A situation where only one or two people (sometimes from the landing-team, not upstream) are working on the blockers should not be what we are aiming for. The general proposition would be for every team delegating one developer per-timezone (by this we mean one for UTC, one for US) as an emergency TRAINCON-0 representative. Those developers would all work together on resolving the issue when an incident happens.

A draft proposition of these rules can be found here, on the TRAINCON-0 page.

2014-06-30: no promotion for over 7 days because of Mir/Qt regressions

Incident Description

After a period of 7 business days without image promotion, we have announced TRAINCON-0 as per the rules defined in our landing process. The main reason for the lack of promotions were regressions caused (mostly) by the Mir 0.3.0 landing. There were a lot of blockers and every time fixes were landing for some, new ones were found by QA that needed attention. Many of the regressions were thought to be caused by the Qt 5.3 landing, which landed almost in the same time as Mir. This caused some confusion, since in the end Qt was found innocent in most of the cases.

Timeline

30.06.2014 UTC Morning: As announced on the 27th, Monday began with the switch to TRAINCON-0 due to the lack of promotions. The images built throughout the weekend had all the regression fixes in them, but we still did not have QA dogfooding results for the last image.

03.06.2014 UTC Evening: After successful dogfooding from Dave, and a +1 permission from Brendan on the autopilot front, we have decided to promote #105 and therefore relieve us from TRAINCON-0

Successes

Some of the regressions were quickly resolved.

Problems

Landings of both Mir and Qt 5.3 almost at the same time were causing some confusion, as we did not have reference images for incremental landings of those components. So we were not able to upgrade to, for instance, an image with only Qt upgraded. I, Ɓukasz, was also away when the big landings happened so coordination wasn't completely perfect.

Recommendations

Dealing Mir landings with more caution and instructing the Mir landers to double test everything. Other thing: making sure that when something big lands, we do not land any other big silo before all packages from the previous landing move out from -proposed.

2014-06-02: smoke tests do not pass after split greeter landing

Incident Description

After the complete landing of the big 'split greeter out of unity8' silo, almost all application tests have been failing in smoke testing infrastructure. The original reason for that was the newly added dependency of dbus-x11 which caused overall confusion in DBus. As because of this we were basically blind on image status, we decided to enter TRAINCON-0.

We made the decision not to revert instantly and try fixing the issue instead. The final deadline for a fix was set to 03.06 at 12:00 UTC.

Timeline

02.06.2014: Problem being noticed on smoketesting, the dbus-x11 dependency removed from the new unity8 greeter as one means of fixing the problem.

03.06.2014: The previous solution carries on some regressions, causing all indicators being invisible on the greeter. The fix was insufficient, as the greeter code was relying on dbus-launch. Saviq and ogra make a fix in the greeter wrapper script by moving the environment setup past session init and starting dbus manually - this way multiple DBus daemons are not spawned. This caused a problem with mtp on mako causing it to restart itself infinitely on the greeter, but was also fixed by an mtp upload that makes mtp not starting on lightdm greeter sessions. Image #63 included that fix, but smoketesting results were still broken. The problem has been identified in the smoketesting infrastructure, which was not cleaning up the package environment in-between tests. This caused the dbus-x11 package left installed after dialer-app test runs. A workaround has been introduced, forcing the removal of dbus-x11 on test-suite teardown. This has been confirmed as working.

Successes

With the swift reaction of Saviq, ogra and many others the issue got resolved without reverting the whole landing.

Problems

The biggest problem here was, first of all, the lack of the upstream developer presence. He was still in transit from the sprint in Malta. Another issue noticed by ogra was that the changelog entries didn't have dependency changes well documented, so without the author present finding the commits and rationale for some of the decisions was harder than it should. Also, the test infrastructure proved to pose additional problems, as without anyone with access to the devices (due to travel) the debugging process was much harder.

Recommendations

First: train upstreams to better comment changes in their changelogs.

LandingTeam/IncidentReports (last edited 2014-08-25 15:31:39 by sil2100)