AutopkgtestInfrastructure

Revision 44 as of 2015-11-26 10:43:26

Clear message

This describes the machinery we use to run autopkgtests for gating uploaded packages into the development series.

Architecture Overview

autopkgtest-cloud-architecture.svg

(Dia source)

Test result store

Swift

The swift object store is being used as the central API for storing and querying results. This ensures that logs are kept safe in a redundant non-SPOF storage, and we do not keep primary data in any cloud instance. Thus we can completely re-deploy the whole system (or any part of it can fatally fail) without losing test results and logs. Swift also provides a flexible API for querying particular results so that consumers (like web interfaces for result browsing, report builders, or proposed-migration) can easily find results based on releases, architectures, package names, and/or time stamps. For this purpose the containers are all publicly readable and browsable, so that no credentials are needed.

Container Names

Logs and artifacts are stored in one container autopkgtest-release for every release, as we want to keep the logs throughout the lifetime of a release and thus it's easy to remove them after EOLing. Results for PPAs are stored in the container autopkgtest-release-lpuser-ppaname (e. g. autopkgtest-wily-pitti-systemd).

Container Layout

In order to allow efficient querying and polling for new results, the logs are stored in this (pseudo-)directory structure:

  • /release/architecture/prefix/sourcepkg/YYYYMMDD_HHMMSS@/autopkgtest_output_files

"prefix" is the first letter (or first four letters if it starts with "lib") of the source package name, as usual for Debian-style archives. Example: /trusty/amd64/libp/libpng/20140321_130412@/log.gz

The '@' character is a convenient separator for using with a container query's delimiter=@ option: With that you can list all the test runs without getting the individual files for each run.

The result files are by and large the contents of autopkgtest's --output-directory plus an extra file exitcode with adt-run's exit code; these files are grouped and tar'ed/compressed:

  • result.tar contains the minimum files/information which clients like proposed-migration or debci need to enumerate test runs and see their package names/versions/outcome: exitcode, testpkg-version, duration, and testbed-packages. All of these are very small (typically ~ 10 kB), thus it's fine to download and cache them all in e. g. the debci frontend for fast access.

  • log.gz is the compressed log from autopkgtest. Clients don't need to download and parse this, but it's the main thing developers look at, so it should be directly linkable/accessible. These have a proper MIME type and MIME encoding so that they can be viewed inline in a browser.

  • artifacts.tar.gz contains testname-{stdout,stderr,packages} and any test specific additional artifacts. Like the log, these are not necessary for machine clients making decisions, but should be linked from the web UI and be available to developers.

Due to Swift's "eventual consistency" property, we can't rely on a group of files (like exit-code and testpkg-version) to be visible at exactly the same time for a particular client, so we must store them in result.tar to achieve atomicity instead of storing them individually.

Example queries

Please read the Swift container API for the precise meaning of these. The current public Swift URL for the production infrastructure is

  • https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac

AMQP queues

RabbitMQ server

AMQP (we use the RabbitMQ server implementation) provides a very robust and simple to use job distribution system, i. e. to coordinate running test requests amongst an arbitrary number of workers. We use explicit ACKs, and ACK only after a test request has been fully processed and its logs stored in swift. Should a worker or a test run fail anywhere in between and the request does not get ACK'ed, it will just be handed to the next worker. This ensures that we never lose test requests in the event of worker failures.

RabbitMQ provides failover with mirrored queues to avoid a single point of failure. This is not currently being used, as RabbitMQ is very robust and runs in its own cloud instance (Juju service rabbitmq-server).

Queue structure

We want to use a reasonably fine-grained queue structure so that we can support workers that serve only certain releases, architectures, virtualization servers, real hardware, etc. For example: debci-wily-amd64 or debci-trusty-armhf. As test requests are not long-lived objects, we remain flexible here and can introduce further granularity as needed; e. g. we might want a trusty-amd64-laptop-nvidia (i. e. running on bare metal without virtualization) queue in the future.

Test request format

A particular test request (i. e. a queue message) has the format srcpkgname <parameter JSON>.

The following parameters are currently supported:

  • triggers: List of trigsrcpkgname/version strings of packages which caused srcpkgname to run (i. e. triggered the srcpkgname test). Ubuntu test requests issued by proposed-migration should always contain this, so that a particular test run for srcpkgname can be mapped to a new version of trigsrcpkgname in -proposed. In case multiple reverse dependencies trigsrc1 and trigsrc2 of srcpkgname get uploaded to -proposed around the same time, the trigger list can contain multiple entries.

  • ppas: List of PPA specification strings lpuser/ppaname. When given, ask Launchpad for the PPAs' GPG fingerprints and add setup commands to install the GPG keys and PPA apt sources. In this case the result is put into the container "autopkgtest-release-lpuser-ppaname" for the last entry in the list; this is is fine grained enough for easy lifecycle management (e. g. remove results for old releases wholesale) and still predictable to the caller for polling results.

Examples:

  • A typical request issued by proposed-migration when a new glib2.0 2.20-1 is uploaded and we want to test one of its reverse dependencies gedit:

    • gedit {"triggers": ["glib2.0/2.20-1"]}

  • Run the systemd package tests against the packages in the pitti/systemd PPA:

    • systemd {"ppas": ["pitti/systemd"]}

  • Run the unity8 package tests against the packages in the stable phone overlay PPA and the ci-train-ppa-service/landing-001 silo PPA:

    • unity8 {"ppas": ["ci-train-ppa-service/stable-phone-overlay", "ci-train-ppa-service/landing-003"]}

Juju service

This uses the standard charm store RabbitMQ charm with some customizations:

  • Remove almighty "guest" user
  • Create user for test requests with random password and limited capabilities (nothing else than creating new messages); these are the credentials for clients like proposed-migration

As usual with the charm, worker services create a relation to the RabbitMQ service, which creates individual credentials for them.

The rabbitmq-server Juju service is exposed on a "public" IP (162.213.33.228), but accessible only within the Canonical VPN and firewalled to only be accessible from snakefruit.canonical.com (the proposed-migration host running britney) and any external workers.

Workers

worker process and its configuration

The worker script is the main workhorse which consumes one AMQP request at a time, runs adt-run, and uploads the results/artifacts into swift. Configuration happens in worker.conf; the options should be fairly self-explanatory. It currently supports two backends, set by the backend option in the [autopkgtest] section:

  • nova uses the adt-virt-ssh runner with the nova ssh setup script. This assumes that the nova credentials are already present in the environment ($OS_*). A name pattern for the image to be used and other parameters are set in worker.conf. This is the backend that we use for running i386/amd64/ppc64el tests in the Canonical ScalingStack cloud.

  • lxc uses the adt-virt-lxc runner. The name pattern for the container to be used and other parameters are set in worker.conf. This is the backend that we currently use for running armhf tests on workers outside of the cloud.

Worker service in the cloud

The autopkgtest-cloud-worker Juju charm sets up a cloud instance which runs 8 parallel worker instances for each cloud instance (i. e. a little less than the maximum allowed number of instances). This is done through a meta-upstart job autopkgtest-worker-launch.conf which triggers some instances of autopkgtest-worker.conf. These will also restart the worker on failure and send a notification email.

The workers use the config files in worker-config-production/*.conf. The macros like #SWIFT_PASSWORD# are filled in by the autopkgtest-worker-launch.conf upstart job. If you change the configs, you need to pkill -HUP worker to restart the worker processes. They will gracefully handle SIGTERM and finish running the current test before they restart.

Note that we currently just use a single cloud instance to control all parallel worker and adt-run instances. This is reasonably reliable as on that instance adt-run effectively just calls some nova commands and copies the results back and forth. The actual tests are executed in ephemeral VMs in ScalingStack.

External workers

These work quite similarly to the cloud ones. You can run one (or several) worker instances with an appropriate worker.conf on any host that is allowed to talk to the RabbitMQ service and swift; i. e. this is mostly just an exercise in sending RT tickets to Canonical to open the firewall accordingly. But remember that all workers need to be within the Canonical VPN.

We currently have such a setup on cyclops-nodeXX for armhf which use the LXC backend. Once ScalingStack supports this architecture these will go away.

debci results browser

The debci project provides all the autopkgtest machinery for Debian, and is deployed at http://ci.debian.net/. Ubuntu's CI deviates from this (tests are triggered by britney instead of debci-batch, and we use swift for the results instead of sending them through AMQP requests), but the result web browser/feed generator can be used more or less unmodified (all necessary changes and tools to support swift artifacts landed upstream).

The debci-web-swift charm sets up debci by installing the necessary dependencies, checking out debci's git, and applying the following customizations:

  • Change /doc symlink to point to http://packaging.ubuntu.com/html/auto-pkg-test.html which is more appropriate for Ubuntu developers

  • Replace the debian.png logo with an Ubuntu logo

  • Install Apache 2 instead of lighttpd, as Apache is Ubuntu's (only) supported web server (in main).
  • Adjust the public/font-awesome/fonts symlink to the older target directory of fonts-awesome in 14.04.

  • The start Juju hook sets up a cron job for retrieving new results from swift (via debci-collect-swift) for all supported releases and architectures, and applies a workaround for a bug.

The charm has a relation to RabbitMQ for listening to the status/logtail fanout AMQP queue and for introspecting the test request queues, to produce the Currently running tests page. Aside from that it's entirely independent from britney, the workers, and all other components.

Deployment

Single-script deployment from wendigo

Everything that's necessary to deploy and configure all services into a freshly bootstrapped Juju environment are contained in deploy.sh. It gives a short help when called without arguments, but usually you would call it like that:

  prod-ues-proposed-migration@wendigo:~$ autopkgtest-cloud/deployment/deploy.sh ~/.scalingstack/

I. e. you give it the directory of all nova RC files that you want to use to run actual tests (should be the various ScalingStack regions). Note that their names must end in .rc.

You can also use deploy.sh for re-deploying a single service after you juju destroy-service'd it.

deploy.sh deploys basenode/ksplice/landscape into all instances, deploys the above RabbitMQ, worker, and debci charms, and does the necessary public IP attachments and exposes. At the end it prints credentials to be used by britney (or other entities requesting tests): These credentials can only be used to publish new test requests, not for consuming them or doing any other queue administration. This needs to be copied to britney.conf on snakefruit.canonical.com.

Issues

  • Rebooting the worker instance stops the relation to RabbitMQ without re-relating at boot. If you need to reboot, manually call juju add-relation rabbitmq-server autopkgtest-cloud-worker from wendigo afterwards. (#1475231)

  • The first time after ScalingStacks get set up, you need to add a firewall rule to allow ssh access from ProdStack:

    • nova secgroup-add-rule default tcp 22 22 162.213.33.179/32

    Run this on every ScalingStack region you are going to use (lcy01, lgw01, bos01).

Integration with proposed-migration (britney)

Debian's britney2 does not integrate with autopkgtests, so Ubuntu's fork modifies it to do so. All the logic for determining the set of tests to run for a particular package, submitting the requests, and collecting the results are contained in the new autopkgtest.py module. This is called from britney.py's write_excuses() function. Tests for a lot of scenarios and bug reproducers are in tests/test_autopkgtest.py which you can just run without further setup (it creates a temporary config and archive for every test case).

Interfacing with the cloud happens via AMQP for requesting a test (e. g. sending a message "firefox" to the debci-trusty-armhf queue) and by downloading new result.tar results from swift on each run. Thus britney only directly depends on the RabbitMQ service and swift, no other services in the cloud. Of course there must be some workers somewhere which actually process the requests, otherwise the triggered tests will stay "in progress" forever.

Administration

Show current tests/requests

http://autopkgtest.ubuntu.com → Running shows the currently running and queued tests. Alternatively, you can use some shell commands:

  • Show queue lengths:
    ssh wendigo.canonical.com sudo -H -u prod-ues-proposed-migration \
        juju ssh rabbitmq-server/0 sudo rabbitmqctl list_queues
  • Show currently running tests:
    ssh wendigo.canonical.com sudo -H -u prod-ues-proposed-migration \
        juju ssh autopkgtest-cloud-worker/0 "'ps ux|grep runner/adt-run'"

Re-running tests

  • Requesting individual manual runs is done with britney's run-autopkgtest script on snakefruit. Due to firewalling this currently can only be run on snakefruit, so define this shell alias:

     alias run-autopkgtest='ssh snakefruit.canonical.com sudo -i -u ubuntu-archive run-autopkgtest'

    Then you can run run-autopkgtest --help to see the usage. E. g.

     # specific architecture
     run-autopkgtest -s wily -a armhf --trigger glib2.0/2.46.1-2 libpng udisks2
     # all configured britney architectures (current default: i386, amd64, ppc64el, armhf)
     run-autopkgtest -s wily --trigger glibc/2.21-0ubuntu4 libpng udisks2
    Note that you should always submit a correct "trigger", i. e. the package/version on excuses.html that caused this test to run. This is necessary so that britney can correctly map results to requests and (soon) as we only use packages from -proposed for the trigger.
  • lp:ubuntu-archive-tools contains a script retry-autopkgtest-regressions which will build a series of run-autopkgtest commands for re-running all current regressions. It's recommended to direct its input into a file, perhaps edit it, and run the commands on snakefruit (as that's much faster than through the above ssh wrapper).

  • There is a page for current temporary testbed failure results, which in most cases are infrastructure bugs; there should be zero. The workers are designed to automatically kill themselves if they fail to create a testbed three times in a row (hitting cloud quota limit or the cloud being broken), and leave the test request in the queue for another worker to process.

    In order to mass-retry tmpfails which britney is waiting on, remove proposed-migration/data/release-proposed/autopkgtest/pending.txt, then on the next run britney will re-request these tests with correct triggers. WARNING: This indicates a bug in the worker, so don't do this unless you carefully checked and understood what caused this and fixed the bug. Please talk to Martin Pitt.

Worker administration

  • Autopkgtest controller access: Most workers (for i386, amd64, ppc64el) are running in a ProdStack instance of juju service autopkgtest-cloud-worker/0:

    ssh -t wendigo.canonical.com sudo -H -u prod-ues-proposed-migration juju ssh autopkgtest-cloud-worker/0
    Consider defining a shell alias for this for convenience. You can see which workers are running with
    initctl list|grep autopkgtest
  • Rolling out new worker code/config:

    • Adjust the worker-config-production/*.conf configuration files, commit them.

    • Run git pull in the autopkgtest-cloud/ checkout on autopkgtest-cloud-worker/0

    • Run pkill -e -HUP worker. This will signal the workers to finish their currently running test and then cleanly exit; the autopkgtest-worker upstart job will then restart them after a minute.

  • Stopping all workers: For general cloud/worker administration or other debugging you might want to stop all workers. Run pkill -e worker; this signals the workers to finish their currently running test and then cleanly exit; contrary to SIGHUP the upstart job will then not auto-restart. If you want/need to stop all workers immediately and thus kill running tests (which will make them appear as failure, you need to re-try them later!), run sudo initctl stop autopkgtest-worker-launch instead.

  • External LXC workers: The lp:auto-package-testing branch has some scripts in the slave-admin dir which help with maintaining the external servers which run LXC autopkgtests. On these there are a system units autopkgtest-lxc-worker.service and autopkgtest-lxc-worker2.service which run the LXC worker. You can see their status and which test they are currently running with:

    ./cmd armhf systemctl status autopkgtest-lxc-worker autopkgtest-lxc-worker2

    ./cmd is just a thin wrapper around parallel-ssh, which is a convenient way to mass-admin these boxes.