QemuKVMMigration

Introduction

Live migration is available quite some time now. Despite the complexities of testing live migration across a variety of releases, machine types, and architectures, live migration continues to get better. Yet it still sometimes causes issues, especially when migrating between different host versions. For example in the past we ran into issues like bug 1291321 related to that - even worse a few of them even seem to resurface recently.

The purpose of this documentation is to focus and hone in the quality and user experience of live migration in Ubuntu. There is a good summary on the general steps taken on a migration at migration info if one wants to refresh related to this discussion (especially chapters vmstate, updating devices and subsections).

Use Cases

Consistency - so far we did only do add Distribution specific machine types to the x86 pc-i440fx type. Even though most changes that had caused us to do so actually affected the pc-q35- type as well. The same is can be true for non x86 architectures. So up for discussion, but I think all discussed below should apply to all supported major server architectures (amd64/i386, arm64, ppc64el, s390x).

There are two use-cases that drive the need. First, we’d like to support users who have deployed VMs on Ubuntu LTSes to be able to live migrate their VMS to the next LTS. This means, that a qemu VM launched with a machine of ‘pc-1.0’ should be the same on LTS and LTS-next. In the past ‘pc-XX’ upstream types have changed and still have no requirement to not change between qemu releases bug 1291321, thus we could not rely upon unversioned upstream machine names. So in the past we introduced a downstream release name bug 1294823. Instead of other downstreams RH bug 895959 we did so on-top of the upstream types. For backward compatibility we are kind of obliged to keep any released types around as long as they are supported.

OTOH Debian and Fedora have just machine classes as-is upstream.

We are an outlier in the term that we keep the upstream types and only add our own ones.

But today's Upstream “versioned” machine names should be fairly safe since 2.x discussion at least if all devices made a full transition to vmstate. But that is only true if we - as a downstream - add/backport no patches affecting that.

This is not as rare as one might hope, an example of such a change, that is even cross all scsi using architectures can be seen at qemu files in the patch for CVE 6351.

Without yet thinking about SRUs yet (below), up to today the default machine type is:

$MACHINE_ARCH-$QEMU_VERSION-$DISTRO_RELEASE

Each following release will keep the previously defined aliases to the specific types for compat. Adding a delta added have to make sure to not only add a new, but also maintain compatibility for the old type. Once our spun-off effort to test such better is in place that could hopefully be used to verify that.

In general we want to make it a Distribution specific type on any release. Instead one could argue that we could evaluate if there is a diff that actually causes any divergence from the usual types. But to do so is severely increased maintenance effort and skill requirement on one hand. And on the other hand it makes a type overview very inconsistent like “where is the one type missing in between those releases?”.

The second use-case where the distro release machine type helps is when in the same release we introduce (SRU) new functions that require an update to the machine type. An example here is on ppc64el where we’re backporting a feature from qemu 2.6 which adds a new hardware device that users need to be available by default when creating a machine. If this feature was added to the pseries-2.5 type, then we have the same issue again of an ‘old’ VM with type ‘pseries-2.5’ which does not match an updated qemu where ‘pseries-2.5’ has a new element now; migration will fail - even updates might.

This second case here drives the need for a “point-release” element to the downstream names. This is not tied to a usual Ubuntu LTS point release, but to anything introducing a delta to machine type / vmstate. Similar like in CentOS/RHEL, we want:

$MACHINE_ARCH-$QEMU_VERSION-$DISTRO_RELEASE-${increment for SRU}

There are actually two kinds of SRU/Backports one has to consider differently here.

One is a feature backport that should be added into an LTS release. Such things are a planned task and should be batched together (as good as possible) to match the sub-releases of an LTS. This will avoid too much proliferation of those subtypes. Of course if there was no change on any given dot release there is no need to add a new incrementet.

The other case is an SRU for a security or severe bug, these are usually unplanned and have to be taken as an emergency measure. In that case the users usually are encouraged anyway to restart their workload to pick up the change just as you know from e.g. some kernel fixes. VENOM, for example affected the floppy device, in particular note the resolution details which indicate the need to run the new binary via stopping/starting, or migration (which invokes the new binary). I think this supports exactly the case here in that when fixing a CVE, it's desirable to retain the same machine-type to support no-downtime "restart" of the binary. Also there is no need nor any good in keeping the old “broken” machine type around - you don’t want the ability to “hey I can still start this with the CVE not fixed“. So for these cases there will be no bump to the machine type. Users will be unable to migrate from an old broken to a new fixed one system, but they are supposed to restart them anyway. This again will prevent a proliferation of types, but more importantly ensure we won’t be forced to keep broken types around if we consider them bad enough that they should go away.

Finally at some point in the future one has to stop adding ever growing delta. So the thought is to clean out old machine types once no more supported and leave the migration paths roughly matching the supported upgrade paths. That means an LTS unifies former releases and upgrades have to "go through them".

Machine Type handling Summary

Handle machine types by:

  • Add Distribution release specific suffix to the default type(s)
    • of each major arch; examples with xenial
    • x86: pc-i440fx-xenial and pc-q35-xenial
    • s390x: s390-ccw-virtio-xenial
    • ppc64el: pseries-xenial
    • arm64: virt-xenial
  • Feature backports will add a -%d to the affected types
    • To avoid a proliferation of those types such changes should be bundled roughly along LTS dot releases.
    • The -%d suffix will not have to match the related dot release it was released with (just an increment)
  • bugfix/security SRUs affecting this will not add an increment
    • They will either not affect it anyway (no-op)
    • Or are so important that users have to restart the guests anyway to pick up the fix
  • Default if no machine type is specified will always point to the latest Distribution specific machine type
  • We are not dropping upstream types, they are provided as-is without further guarantees
    • Cross vendor/downstream migrations might work for upstream types, but are considered not supported
    • This was the case ever since, but package doc or so might need to be updated to reflect this.
  • Cleanup matching the usual supported Distribution upgrade paths
    • Drop former non LTS release definitions after next LTS
    • Drop former LTS release definitions when out of support

Example

An example flow through releases and upgrades:

a release that has a machine type / vmstate diff for all x86 based machines, but none for others.

 pc-i440fx-2.5-xenial
 pc-q35-2.6-yenial

Gets a xenial feature backport SRU on LTS dot release, but it only affects q35 based machines

 pc-i440fx-2.5-xenial
 pc-q35-2.5-xenial
 +pc-q35-2.6-xenial-1

Gets an SRU for a CVE, users are supposed to restart to pick fix up

 <no change>

Gets another Feature backport SRU that affects all types on next dot release

 pc-i440fx-2.5-xenial
 +pc-i440fx-2.5-xenial-1
 pc-q35-2.5-xenial
 pc-q35-2.6-xenial-1
 +pc-q35-2.6-xenial-2

Here in a more visual overview.

Support Matrix

This shall try to list the migrations paths that are expected to work. In general those should match the Ubuntu upgrade paths. So interim release can migrate to following interim release as long as supported. Later on those are unified by the LTS release. And always LTS to following LTS. Of course migrations from a release to "itself" are supported as well.

from v / to >

LTS

Int

Int+1

Int+2

LTS+1

Int+3

Int+4

Int+5

LTS+2

LTS

Y

Y

Y

Y

Int

Y

Y

Int+1

Y

Y

Int+2

Y

Y

LTS+1

Y

Y

Y

Int+3

Y

Y

Int+4

Y

Y

Int+5

Y

Y

LTS+2

Y

Others might work as well - in fact quite some do, but only those listed are officially considered supported.

The same applies when thinking not about live migrations, but instead stopping (+maybe moving) a guest and starting it on another host or after an upgrade. But in that case one doesn't have to give up if the upgrade path isn't supported as above - instead in most of the cases you just have to update your guest configuration to lift it to a newer machine type. At least Linux guest mostly auto-detect the new features and work fine. A short guide is below in the "upgrade machine type" section.

Backward migrations in particular are not considered supported by upstream. Upstream as well as the packaging of fixes tries to keep things working, but expect any backward migration to be risky. See this bug as an discussion on that case.

Sometimes when reaching a state where the initial guest was created on a now no more supported release the administrator has to upgrade the machine type from the now unsupported to a newer one - see section below for details.

Upgrade machine type

You might want to update your machine type of an existing defined guest to:

  • to pick up latest security fixes and features
  • continue using a guest created on a now unsupported release

In general it is recommended to update machine types when upgrading qemu/kvm to a new major version. But this can likely never be an automated task as this change is guest visible. The guest devices might change in appearance, new features will be announced to the guest and so on. Linux is usually very good at tolerating such changes, but it depends so much on the setup and workload of the guest that this has to be evaluated by the owner/admin of the system. Other operating systems where known to often have severe impacts by changing the HW. Consider a machine type change similar to replacing all devices and firmware of a physical machine to the latest revision - all considerations that apply there apply to evaluating a machine type upgrade as well.

As usual with major configuration changes it is wise to back up your guest definition and disk state to be able to do a rollback just in case. There is no integrated single command to update the machine type via virsh or similar tools. It is a normal part of your machine definition. And therefore updated the same way as most others.

(i) There now also is the experimental tool virt-machine-type to help upgrading machine types. This is now released as a snap, so you can get it as easy as:

$ snap install virt-machine-type --edge

If you want to do it the manual/old way first shutdown your machine and wait until it has reached that state.

   1 virsh shutdown <yourmachine>
   2 # wait
   3 virsh list --inactive
   4 # should now list your machine as "shut off"
   5 

Then edit the machine definition and find the type in the type tag at the machine attribute.

   1 virsh edit <yourmachine>
   2 <type arch='x86_64' machine='pc-i440fx-xenial'>hvm</type>

Change this to the value you want. If you need to check what types are available via "-M ?" Note that while providing upstream types as convenience only Ubuntu types are supported. There you can also see what the current default would be. In general it is strongly recommended that you change to newer types if possible to exploit newer features, but also to benefit of bugfixes that only apply to the newer device virtualization.

   1 kvm -M ?
   2 # lists machine types, e.g.
   3 pc-i440fx-xenial       Ubuntu 16.04 PC (i440FX + PIIX, 1996) (default)
   4 ...

After this you can start your guest again. You can check the current machine type from guest and host depending on your needs.

   1 virsh start <yourmachine>
   2 # check from host, via dumping the active xml definition
   3 virsh dumpxml <yourmachine> | xmllint --xpath "string(//domain/os/type/@machine)" -
   4 # or from the guest via dmidecode
   5 sudo dmidecode | grep Product -A 1
   6         Product Name: Standard PC (i440FX + PIIX, 1996)
   7         Version: pc-i440fx-xenial

If you keep non-live definitions around like xml files remember to update those as well.

Testing

A packager should use the tests from https://code.launchpad.net/~ubuntu-server/ubuntu/+source/qemu-migration-test/+git/qemu-migration-test to verify uploads. Regular tests with those running against proposed will be added to https://jenkins.ubuntu.com/server/

QemuKVMMigration (last edited 2017-02-17 09:47:42 by paelzer)