AutopkgtestInfrastructure

Differences between revisions 22 and 106 (spanning 84 versions)
Revision 22 as of 2015-09-03 08:26:07
Size: 18744
Editor: pitti
Comment:
Revision 106 as of 2021-04-27 11:45:11
Size: 64
Editor: laney
Comment: point to the new readthedocs site
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
This describes the machinery we use to run autopkgtests for [[ProposedMigration|gating uploaded packages into the development series]].

<<TableOfContents()>>

= Architecture Overview =

{{attachment:autopkgtest-cloud-architecture.svg||width=100%}}

([[attachment:autopkgtest-cloud-architecture.dia|Dia source]])

= Swift result store and layout =

The swift object store is being used as the central API for storing and querying results. This ensures that logs are kept safe in a redundant non-SPOF storage, and we do not keep primary data in any cloud instance. Thus we can completely re-deploy the whole system (or any part of it can fatally fail) without losing test results and logs. Swift also provides a [[http://developer.openstack.org/api-ref-objectstorage-v1.html#storage_container_services|flexible API for querying particular results]] so that consumers (like web interfaces for result browsing, report builders, or proposed-migration) can easily find results based on releases, architectures, package names, and/or time stamps. For this purpose the containers are all publicly readable and browsable, so that no credentials are needed.

Logs and artifacts are stored in one container `adt-`''release'' for every release, as we want to keep the logs throughout the lifetime of a release and thus it's easy to remove them after EOLing. In order to allow efficient querying and polling for new results, the logs are stored in this (pseudo-)directory structure:

  /release/architecture/prefix/sourcepkg/YYYYMMDD_HHMMSS@/autopkgtest_output_files

"prefix" is the first letter (or first four letters if it starts with "lib") of the source package name, as usual for Debian-style archives. Example: `/trusty/amd64/libp/libpng/20140321_130412@/log.gz`

The '`@`' character is a convenient separator for using with a container query's `delimiter=@` option: With that you can list all the test runs without getting the individual files for each run.

The result files are by and large the contents of autopkgtest's `--output-directory` plus an extra file `exitcode` with `adt-run`'s exit code; these files are grouped and tar'ed/compressed:

 * `result.tar` contains the minimum files/information which clients like proposed-migration or debci need to enumerate test runs and see their package names/versions/outcome: `exitcode`, `testpkg-version`, `duration`, and `testbed-packages`. All of these are very small (typically ~ 10 kB), thus it's fine to download and cache them all in e. g. the debci frontend for fast access.

 * `log.gz` is the compressed `log` from autopkgtest. Clients don't need to download and parse this, but it's the main thing developers look at, so it should be directly linkable/accessible. These have a proper MIME type and MIME encoding so that they can be viewed inline in a browser.

 * `artifacts.tar.gz` contains ''testname''-`{stdout,stderr,packages}` and any test specific additional artifacts. Like the log, these are not necessary for machine clients making decisions, but should be linked from the web UI and be available to developers.

Due to Swift's "eventual consistency" property, we can't rely on a group of files (like `exit-code` and `testpkg-version`) to be visible at exactly the same time for a particular client, so we must store them in `result.tar` to achieve atomicity instead of storing them individually.

= AMQP queues =

== RabbitMQ server ==

AMQP (we use the [[http://www.rabbitmq.com/|RabbitMQ]] server implementation) provides a very robust and simple to use job distribution system, i. e. to coordinate running test requests amongst an arbitrary number of workers. We use explicit ACKs, and ACK only after a test request has been fully processed and its logs stored in swift. Should a worker or a test run fail anywhere in between and the request does not get ACK'ed, it will just be handed to the next worker. This ensures that we never lose test requests in the event of worker failures.

RabbitMQ provides failover with [[http://www.rabbitmq.com/ha.html|mirrored queues]] to avoid a single point of failure. This is not currently being used, as RabbitMQ is very robust and runs in its own cloud instance (Juju service `rabbitmq-server`).

== Queue structure ==

We want to use a reasonably fine-grained queue structure so that we can support workers that serve only certain releases, architectures, virtualization servers, real hardware, etc. For example: `debci-wily-amd64` or `debci-trusty-armhf`. As test requests are not long-lived objects, we remain flexible here and can introduce further granularity as needed; e. g. we might want a `trusty-amd64-laptop-nvidia` (i. e. running on bare metal without virtualization) queue in the future.

A particular test request (i. e. a queue message) currently just consists of the source package name. Additional fields, such as "PPA name" or perhaps version constraints may be added in the future.

== Juju service ==

This uses the [[https://jujucharms.com/rabbitmq-server/|standard charm store RabbitMQ charm]] with some customizations:

 * Remove almighty "guest" user
 * Create user for test requests with random password and limited capabilities (nothing else than creating new messages); these are the credentials for clients like proposed-migration

As usual with the charm, worker services create a relation to the RabbitMQ service, which creates individual credentials for them.

The `rabbitmq-server` Juju service is exposed on a "public" IP (162.213.33.228), but accessible only within the Canonical VPN and firewalled to only be accessible from `snakefruit.canonical.com` (the proposed-migration host running britney) and any external workers.

= Workers =

== worker process and its configuration ==

The [[https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/worker/worker|worker]] script is the main workhorse which consumes one AMQP request at a time, runs `adt-run`, and uploads the results/artifacts into swift. Configuration happens in [[https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/worker/worker.conf|worker.conf]]; the options should be fairly self-explanatory. It currently supports two backends, set by the `backend` option in the `[autopkgtest]` section:

 * `nova` uses the [[http://manpages.ubuntu.com/adt-virt-ssh|adt-virt-ssh]] runner with the [[http://anonscm.debian.org/cgit/autopkgtest/autopkgtest.git/tree/ssh-setup/nova|nova ssh setup script]]. This assumes that the nova credentials are already present in the environment (`$OS_*`). A name pattern for the image to be used and other parameters are set in `worker.conf`. This is the backend that we use for running i386/amd64 tests in the ScalingStack cloud.

 * `lxc` uses the [[http://manpages.ubuntu.com/adt-virt-lxc|adt-virt-lxc]] runner. The name pattern for the container to be used and other parameters are set in `worker.conf`. This is the backend that we currently use for running armhf/ppc64el tests on workers outside of the cloud.

== Worker service in the cloud ==

The [[https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/deployment/charms/trusty/autopkgtest-cloud-worker|autopkgtest-cloud-worker]] Juju charm sets up a cloud instance which runs 8 parallel worker instances for each cloud instance (i. e. a little less than the maximum allowed number of instances). This is done through a meta-upstart job [[https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/deployment/charms/trusty/autopkgtest-cloud-worker/autopkgtest-worker-launch.conf|autopkgtest-worker-launch.conf]] which triggers some instances of [[https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/deployment/charms/trusty/autopkgtest-cloud-worker/autopkgtest-worker.conf|autopkgtest-worker.conf]]. These will also restart the worker on failure and send a notification email.

The workers use the config file `~/worker.conf` which gets installed by the charm (copied from the master file on `wendigo`). If you change the file, you need to `pkill worker` to restart the worker processes. They will gracefully handle SIGTERM and finish running the current test before they restart.

Note that we currently just use a single cloud instance to control ''all'' parallel `worker` and `adt-run` instances. This is reasonably reliable as on that instance `adt-run` effectively just calls some `nova` commands and copies the results back and forth. The actual tests are executed in ephemeral VMs in ScalingStack.

== External workers ==

These work quite similarly to the cloud ones. You can run one (or several) worker instances with an appropriate `worker.conf` on any host that is allowed to talk to the RabbitMQ service and swift; i. e. this is mostly just an exercise in sending RT tickets to Canonical to open the firewall accordingly. But remember that all workers need to be within the Canonical VPN.

We currently have such a setup on `cyclops-nodeXX` for armhf, and `wolfe-XX` for ppc64el which use the LXC backend. Once ScalingStack supports these architectures these will go away.

= debci results browser =

The [[http://anonscm.debian.org/cgit/collab-maint/debci.git|debci]] project provides all the autopkgtest machinery for Debian, and is deployed at [[http://ci.debian.net/]]. Ubuntu's CI deviates from this (tests are triggered by britney instead of `debci-batch`, and we use swift for the results instead of sending them through AMQP requests), but the result web browser/feed generator can be used more or less unmodified (all necessary changes and tools to support swift artifacts landed upstream).

The [[https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/deployment/charms/trusty/debci-web-swift|debci-web-swift charm]] sets up debci by installing the necessary dependencies, checking out debci's git, and applying the following customizations:

 * Change `/doc` symlink to point to http://packaging.ubuntu.com/html/auto-pkg-test.html which is more appropriate for Ubuntu developers
 * Replace the `debian.png` logo with an Ubuntu logo
 * Install Apache 2 instead of lighttpd, as Apache is Ubuntu's (only) supported web server (in main).
 * Adjust the `public/font-awesome/fonts` symlink to the older target directory of `fonts-awesome` in 14.04.
 * The [[https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/deployment/charms/trusty/debci-web-swift/hooks/start|start]] Juju hook sets up a cron job for retrieving new results from swift (via [[http://anonscm.debian.org/cgit/collab-maint/debci.git/tree/bin/debci-collect-swift|debci-collect-swift]]) for all supported releases and architectures, and applies a workaround for a bug.

The charm does not need any credentials or relations, it's entirely independent from britney, the workers, and all other components.

= Deployment =

== Single-script deployment from wendigo ==

Everything that's necessary to deploy and configure all services into a freshly bootstrapped Juju environment are contained in [[https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/deployment/deploy.sh|deploy.sh]]. It gives a short help when called without arguments, but usually you would call it like that:

{{{
  prod-ues-proposed-migration@wendigo:~$ autopkgtest-cloud/deployment/deploy.sh worker.conf ~/.scalingstack/lcy01 ~/.scalingstack/lgw01
}}}

I. e. you give it the following arguments:

 * The local `worker.conf` file (see below) which gets copied to the worker service. This contains the Swift credentials and Ubuntu specific configurations for which releases/archtitectures to run and which environment specific proxies to use, etc.
 * A nova RC file for the first cloud to run the actual tests in (should be ScalingStack).
 * Optionally, a second nova RC file; then tests will be run in both clouds. (ScalingStack has two more or less equal regions)

You can also use `deploy.sh` for re-deploying a single service after you `juju destroy-service`'d it.

`deploy.sh` deploys basenode/ksplice/landscape into all instances, deploys the above RabbitMQ, worker, and debci charms, and does the necessary public IP attachments and exposes. At the end it prints credentials to be used by britney (or other entities requesting tests): These credentials can only be used to publish new test requests, not for consuming them or doing any other queue administration. This needs to be copied to `britney.conf` on `snakefruit.canonical.com`.

== worker.conf ==

This is complete except for the swift password, which you need to copy from `~/.novarc`. Note that the `[AMQP]` section should be empty, it gets filled in automatically by relating the worker services to the RabbitMQ service.

{{{
[amqp]
host =
user =
password =

[swift]
region_name = bootstack-ps45
auth_url = http://10.24.0.132:5000/v2.0/
username = prod-ues-proposed-migration
tenant = prod-ues-proposed-migration_project
password = S3KRIT-CHANGEME

[autopkgtest]
checkout_dir = /home/ubuntu/autopkgtest
releases = precise trusty vivid wily
architectures = i386 amd64
# testbed backwards compat for trusty
setup_command = if grep -q trusty /etc/lsb-release; then apt-get install -y build-essential; fi
big_packages = binutils chromium-browser glibc libreoffice linux python2.7 python3.4 tdb firefox akonadi
long_tests = gutenprint gmp-ecm
backend = nova

[nova]
image_pattern = ubuntu/ubuntu-$RELEASE-.*-$ARCHITECTURE-server
flavor = m1.small
big_flavor = m1.large
novaopts = --keyname=testbed --net-id=net_ues_proposed_migration -e 'http_proxy=http://squid.internal:3128' -e 'https_proxy=http://squid.internal:3128' -e 'no_proxy=127.0.0.1,127.0.1.1,localhost,localdomain,novalocal,internal,archive.ubuntu.com,security.ubuntu.com,ddebs.ubuntu.com' --mirror=http://ftpmaster.internal/ubuntu

[lxc]
container = adt-$RELEASE
lxcopts = --eatmydata --sudo
}}}

== Issues ==

 * The worker charm currently fails to set the correct `--keyname` in `worker.conf`. Set this manually for the time being: `--keyname=testbed-$(hostname)`. ([[https://launchpad.net/bugs/1480962|#1480962]])
 * Rebooting the worker instance stops the relation to RabbitMQ without re-relating at boot. If you need to reboot, manually call `juju add-relation rabbitmq-server autopkgtest-cloud-worker` from `wendigo` afterwards. ([[https://launchpad.net/bugs/1475231|#1475231]])
 * The very first time after ProdStack gets set up, you need to add a firewall rule to allow it to talk to ScalingStack:

  {{{
nova secgroup-add-rule default tcp 22 22 162.213.33.179/32
  }}}

 (This is only relevant if ProdStack ever gets torn down and rebuilt).

= Integration with proposed-migration (britney) =

Debian's britney2 does not integrate with autopkgtests, so [[http://bazaar.launchpad.net/~ubuntu-release/britney/britney2-ubuntu/|Ubuntu's fork]] modifies it to do so. All the logic for determining the set of tests to run for a particular package, submitting the requests, and collecting the results are contained in the new [[http://bazaar.launchpad.net/~ubuntu-release/britney/britney2-ubuntu/view/head:/autopkgtest.py|autopkgtest.py module]]. This is called from [[http://bazaar.launchpad.net/~ubuntu-release/britney/britney2-ubuntu/view/head:/britney.py|britney.py]]'s `write_excuses()` function. Tests for a lot of scenarios and bug reproducers are in [[http://bazaar.launchpad.net/~ubuntu-release/britney/britney2-ubuntu/view/head:/tests/test_autopkgtest.py|tests/test_autopkgtest.py]] which you can just run without further setup (it creates a temporary config and archive for every test case).

Interfacing with the cloud happens via AMQP for requesting a test (e. g. sending a message "firefox" to the `debci-trusty-armhf` queue) and by downloading new `result.tar` results from swift on each run. Thus britney only directly depends on the RabbitMQ service and swift, no other services in the cloud. Of course there must be some workers somewhere which actually process the requests, otherwise the triggered tests will stay "in progress" forever.

= Administration =

 * Reqesting manual runs is done with britney's [[http://bazaar.launchpad.net/~ubuntu-release/britney/britney2-ubuntu/view/head:/run-autopkgtest|run-autopkgtest]] script on snakefruit. Due to firewalling this currently can only be run on snakefruit, so define this shell alias:

 {{{
 alias run-autopkgtest='ssh snakefruit.canonical.com sudo -i -u ubuntu-archive run-autopkgtest'
 }}}

 Then you can run `run-autopkgtest --help` to see the usage. E. g.

 {{{
 # specific architecture
 run-autopkgtest -s wily -a armhf libpng udisks2
 # all configured britney architectures (current default: i386, amd64)
 run-autopkgtest -s wily libpng udisks2
 }}}

 * Show queue lengths, until that [[https://launchpad.net/bugs/1479811|gets shown in debci]]:

 {{{
ssh wendigo.canonical.com sudo -H -u prod-ues-proposed-migration \
    juju ssh rabbitmq-server/0 sudo rabbitmqctl list_queues
 }}}

 * Show currently running tests:

 {{{
ssh wendigo.canonical.com sudo -H -u prod-ues-proposed-migration \
    juju ssh autopkgtest-cloud-worker/0 "'ps ux|grep runner/adt-run'"
 }}}

 * There is a page for current [[http://autopkgtest.ubuntu.com/status/alerts/|temporary testbed failure]] results, which in most cases are infrastructure bugs; there should be zero.

 You can also download the JSON data to print commands to re-run all tmpfail tests (you should consolidate them into one command per series/arch):
 {{{
ssh wendigo.canonical.com sudo -H -u prod-ues-proposed-migration \
    juju ssh debci-web-swift/0 grep -l tmpfail 'debci/data/packages/*/*/*/*/latest.json' | \
    { export IFS='/'; while read _ _ _ r a _ p _ ; do echo run-autopkgtest -s $r -a $a $p; done; }
 }}}

 * The [[https://code.launchpad.net/~auto-package-testing-dev/auto-package-testing/trunk|lp:auto-package-testing]] branch has some [[http://bazaar.launchpad.net/~auto-package-testing-dev/auto-package-testing/trunk/files/head:/slave-admin/|scripts in the slave-admin dir]] which help with maintaining the armhf and ppc64el nodes. On these there is a system unit `autopkgtest-lxc-worker.service` which runs the LXC worker. You can see their status and which test they are currently running with:

 {{{
./cmd armhf systemctl status autopkgtest-lxc-worker
./cmd ppc64el systemctl status autopkgtest-lxc-worker
./cmd ppc64el systemctl status autopkgtest-lxc-worker2
 }}}

 (The ppc64el boxes can run two tests in parallel, the cloned `autopkgtest-lxc-worker2.service` does that.)

 `./cmd` is just a thin wrapper around parallel-ssh, which is a convenient way to mass-admin these boxes.
#REFRESH 1 https://autopkgtest-cloud.readthedocs.io/en/latest/

ProposedMigration/AutopkgtestInfrastructure (last edited 2021-04-27 11:45:11 by laney)