<> ----- ##(see the SpecSpec for an explanation) * '''Launchpad Entry''': UbuntuSpec:foundations-q-upstart-stateful-re-exec * '''Created''': 18 May 2012 * '''Contributors''': JamesHunt * '''Packages affected''': upstart * '''Related blueprints''': UbuntuSpec:foundations-q-event-based-initramfs * '''Code''': [[https://code.launchpad.net/~upstart-devel/upstart/stateful-reexec|lp:~upstart-devel/upstart/stateful-reexec]] [[https://code.launchpad.net/~jamesodhunt/dbus/create-connection-from-fd|lp:~jamesodhunt/dbus/create-connection-from-fd]] == Summary == This specification describes the plan to implement "stateful re-exec" support in Upstart. Stateful re-exec refers to being able to restart Upstart ''and'' maintain the internal state across an {{{exec(2)}}} call. Upstart already has the ability to re-exec itself, but it loses all internal state when doing this since it is equivalent to simply restarting Upstart. This causes problems since after re-exec there may be processes running which Upstart ''had'' been managing, but post-exec it has no knowledge of them. As such, re-exec is only used on shutdown to minimise issues. == Release Note == The "stateful re-exec" feature makes no changes to the externals of Upstart, it simply means that "`telinit u`" is now safe to use and will restart Upstart with no loss of state. The man page for `telinit(8)` has been updated to state this. "`telinit u`" should be called when either Upstart itself or any of its dependent libraries are upgraded (`libc`, `libnih` and `libjson`) to ensure that the running instance of Upstart is at the same version as the on-disk version, and that it is using the latest versions of all on-disk dependent libraries. Note that now, Upstart relies on (and is therefore linked to) `libjson.so` to handle serialisation and deserialisation of state. == Rationale == By providing the ability to retain state, the following goals can be achieved: * Allow Upstart to be started in the initramfs and then be re-execed such that state is passed through to the main system. See: https://blueprints.launchpad.net/ubuntu/+spec/foundations-q-event-based-initramfs * Allow Upstart to be upgraded and not require a reboot to take advantage of new features. * Allow clean upgrades to libc6 (eglibc) and NIH (being two libraries Upstart relies upon). Currently, upgrading either of these libraries is problematic since because Upstart cannot perform stateful re-exec, it will still be holding open the original versions of these files causing shutdown issues. For full details, see bug Bug:985755. == Use Cases == * Clio is a sysadmin. She wants to be able to upgrade any userland package on her 10,000 servers without having to schedule down-time. * Mnemosyne is an experienced Ubuntu user who lives has her life on her laptop. She hardly ever reboots, preferring instead to suspend where possible. However, she'd still like to know she's running the latest and greatest version of all the (non-kernel) packages. * Morpheus is a very cautious server owner. He only runs LTS releases and want to be assured that upgrading to a new LTS will be 100% reliable. He always reboots after an upgrade, but expects services he left running before the upgrade to continue to be running after upgrade (bug Bug:985755). == Requirements == * Ability to serialise Upstarts internal state. * Ability to deserialise previously serialised state back into Upstart. * Ability to upgrade from a version not supporting stateful re-exec to a version that does support it. * Ability to handle ''downgrade'' from a version supporting stateful re-exec to a version that does. * Ability to upgrade from a version supporting stateful re-exec to a newer version that also supports it but whose serialisation format may have changed. * Ability to ''downgrade'' from a version supporting stateful re-exec to an older version supporting it but whose serialisation format may have changed. * Ability for any version of Upstart supporting stateful re-exec to be able to generate serialisation data at the current serialisation data format version. * Ability for any version of Upstart supporting stateful re-exec to be able to generate serialisation data at any previous serialisation data format version. (to allow for stateful downgrades - re-exec scenario "{{{SSD-L}}}"). * Ability to handle failure to read the existing state partially or fully. * Ability to handle failure to parse the existing state partially or fully. * The serialisation data should encode the version of Upstart. * The serialisation data should encode the version of the serialisation format. * The serialisation data should encode Session objects. * The serialisation data should encode Event objects. * The serialisation data should encode JobClass objects. * The serialisation data should encode Job objects. * Abilility for all Jobs and associated processes running prior to the stateful re-exec to continue to be managed by Upstart after the stateful re-exec. * Ability to retain command-line settings across an exec. Currently, Upstart clears the command-line to "prettify" output for {{{ps(1)}}}. We should probably continue to do this, but ''also'' save {{{argv}}} to allow the re-exec to be run with the same options as when it was originally started (it would be confusing to boot with debug mode and have it revert to non-debug mode after a re-exec). == Design == Upstart needs the ability to perform the following operations in order: 1. Serialise its existing internal state. 1. Re-exec itself. 1. Read the serialised state and deserialise it back into its internal data structures. 1. Continue operating as normal. Although the state passing will not be "total" (not every single internal data structure will or can be handled), after the deserialisation Upstart should have as near full knowledge of the state prior to the re-exec as possible. == Implementation == === State-Passing === Upstart will re-exec itself using the following process. Note that the ''parent'' becomes the new instance of Upstart, ''not'' the child (since Upstart must continue to be PID 1). 1. Admin or maintainer script calls "{{{telinit u || :}}}" to request Upstart restart itself. This is the existing re-exec interface so Upstart will be changed to now also perform state-passing. Basic (non-stateful) re-exec support is currently available and used by {{{/etc/init.d/umountroot}}} to ensure Upstart doesn't hold "stale" links to old library versions which would cause shutdown to hang. 1. The {{{SIGTERM}}} handler calls a new re-exec handling function. This ensures Upstart is not sitting in the main loop. Ensure all signals are blocked. 1. Dispatch all D-Bus messages to ensure no {{{initctl}}} commands are being handled. Upstart is now effectively "paused" (no longer accepting D-Bus (and also initctl) commands, handling jobs nor emitting events). 1. Serialise all required internal state. If this fails, degrade to stateless re-exec. 1. Create a pipe. 1. Ensure pipe fds are '''NOT''' marked {{{O_CLOEXEC}}} such that they perist across an {{{exec(2)}}}. 1. Parent sets {{{O_CLOEXEC}}} for all D-Bus file descriptors so they are NOT closed on exec. 1. Parent sets {{{O_CLOEXEC}}} for all Log file descriptors so they are NOT closed on exec. 1. Fork to create a child process. 1. Parent closes writing end of pipe. 1. Parent {{{exec(2)}}}s the new version of {{{/sbin/init}}} passing a magic flag ("{{{--state-fd }}}") which informs Upstart to read state from the specified file descriptor. The {{{--state-fd}}} option is new and distinct from the existing {{{--restart}}} flag which performs a "bare" re-exec. The child is now actually the "old" process (running the original version of Upstart), whereas the parent is now the "new" process (running the most-recently-installed version of Upstart (which ''may'' be older than the old!!) 1. Parent blocks reading from file descriptor "{{{}}}". 1. Child meanwhile closes reading end of the pipe. 1. Child closes D-Bus control server connection to allow new parent to open it. 1. Child closes D-Bus control bus connection to allow new parent to open it. 1. Child writes state in JSON format (as generated by original parent) to writing end of pipe and exits. If it fails to complete the write in "some amount of time" (say 10 seconds?), this is deemed to indicate that the new parent doesn't support serialisation but the check performed above failed somehow, so it should log an error to the system log and exit. This scenario '''should''' be impossible, but handle it anyway. 1. Parent reads serialisation data from reading end of pipe and reconstructs the internal objects (sessions, events, JobClasses, Jobs). 1. Parent closes reading end of pipe. 1. Parent clears {{{O_CLOEXEC}}} flag for all deserialised D-Bus and Log objects such that they are not leaked to Jobs. 1. Parent continues with normal initialisation. === Preparatory Tasks === Before attempting stateful re-exec, PID 1 needs to handle the following: ==== D-Bus ==== * Complete servicing of all possible D-Bus client requests ({{{initctl}}}, etc) (by flushing the D-Bus queue for the control bus). Note that some existing D-Bus messages will linger (for example those associated with long-running {{{initctl emit foo}}}-type scenarios) and will therefore need to be encoded. * Obtain file descriptor for the control bus from D-Bus. * Close the control server (used by {{{initctl}}} for root only) This has to be done to stop any new requests and because the new Upstart will want to create it. * Clear the close-on-exit flag for the control bus file descriptor. * Serialise the control bus file descriptor. * Serialise all blocked messages ({{{Blocked->message}}}) along with their D-Bus serial number ({{{dbus_message_get_serial(blocked->message->message)}}}). This involves first marshalling the D-Bus message using {{{dbus_message_marshal()}}}. * Obtain the file descriptors for all existing D-Bus connections by iterating through {{{control_conns}}} and calling {{{dbus_connection_get_unix_fd()}}} on each connection prior to serialising them. ==== ptrace ==== All processes curently being ptraced need to be handled. The most reasonable approach would seem to be to wait for the application to reach the {{{started}}} state. However, that is dangerous. Imagine this scenario: 1. user creates a new job and mis-specifies the {{{expect}}} stanza ("{{{export daemon}}}" when the application doesn't even fork). 1. {{{sudo apt-get dist-upgrade}}} pulls in new version of Upstart. 1. PID 1 waits for the erroneous app to complete 2 forks. The final step will never complete so the {{{apt-get dist-upgrade}}} will hang indefinately. We could "timeout" after a few seconds of waiting maybe but that approach is ugly. Since {{{ptrace(3)}}} '''IS''' retained across an {{{exec(3)}}} of the parent ("debugger") process, no special treatment is required: processes which were being ptraced prior to the re-exec will continue to trap and pass control to the re-exec'ed PID 1 after the re-exec. === Serialisation === [[http://tools.ietf.org/html/rfc4627|JSON]] will be used to represent the serialised state. Rationale: * JSON is simple. * JSON is standardised. * JSON is able to represent arrays and objects. * JSON is UTF-8 encoded and human-readable. * There are a number of available parsers. ==== Process ==== The minimum set of internal objects that are required to perform stateful re-exec are: * {{{Session}}} * {{{Event}}} * {{{JobClass}}} * {{{Job}}} * {{{Log}}} ===== Serialisation ===== 1. Create a "header" encoding meta data about the serialisation, including: * Upstart version * timestamp * serialisation data format version 1. Serialise all {{{Session}}} objects. 1. Serialise all {{{Event}}} objects including the {{{blocking}}} list. 1. Iterate over all {{{JobClass}}} objects in {{{job_classes}}} and all {{{Job}}} instances referenced by each {{{JobClass}}} and serialise all {{{Job}}} objects "below" (as a child of) their associated parent {{{JobClass}}} in the JSON. 1. For those jobs whose associated {{{JobClass.console}}} is {{{CONSOLE_LOG}}}, serialise a {{{Log}}} object. Notes: * Event though {{{Event}}} objects and reference {{{Job}}} objects and ''vice versa'', the serialisation can be handled in a single pass since as-yet-unserialised entities can be safely referred to by name or index value since we know we will eventually serialise all entities. * If Upstart detects that the user is down-grading and it sees syntax it doesn't understand, a flag will be added to the {{{ConfSource}}} stating that the job cannot be restarted (although it can still be stopped). * {{{Log}}} may pose some problems since it includes an {{{NihIo}}} and an {{{NihIoBuffer}}}. The {{{NihIoBuffer}}} is relatively easy to serialise, but the {{{NihIo}}}, along with including an {{{NihIoBuffer}}} also includes an {{{NihIoWatch}}}. ===== Deserialisation ===== 1. Read the serialisation data ensuring the "header" can be understood. 1. Deserialise the {{{Session}}} objects. 1. Deserialise the {{{Event}}} objects. 1. Deserialise the {{{JobClass}}} objects. 1. Deserialise the {{{Job}}} objects associated with each {{{JobClass}}} object. 1. Create a ConfSource with a "special" path ("serialized_conf_source" or similar) that is not backed by any actual file. If a job that is already running from the initramfs is stopped and started, at that point you get the correct ''new'' {{{/etc/init/job.conf}}} from the main system. Notes: * Since an {{{Event}}} can block a {{{Job}}} and a {{{Job}}} has pointers to one or more {{{Event}}} objects, the deserialisation cannot be performed in a single pass. To resolve this circular loop, first {{{Events}}} are deserialised '''without''' their {{{blocking}}} list. Next, {{{JobClass}}} and {{{Job}}} objects are serialised, again '''without''' associated any {{{blocking}}} lists associated with {{{Job}}}s. Once all objects are (partially) deserialised, a second pass is made where all {{{blocking}}} lists are "fixed up" ({{{blocked_new()}}} is called). This now works since {{{blocked_new}}} now has atleast a skeletal object ({{{Job}}}, {{{Event}}}, ''et cetera'') to reference. After the second pass, all objects are able to correctly reference one another. ==== Data Representation ==== ===== Schema ===== The format of the JSON should resemble: ||<#FF0000> FIXME || ===== Example ===== See http://people.canonical.com/~jhunt/upstart/stateful-reexec/state.json == Re-Exec Scenarios == Table showing possible re-exec scenarios. ||'''Scenario'''||'''Importance'''||'''Old Version'''||'''New Version'''||'''Scenario'''||'''Re-Exec Strategy'''||'''Notes'''|| ||{{{NNU}}} || high || not supported || not supported || non-stateful upgrade || stateless || Behaviour today || ||{{{NSU}}} || high || not supported || supported @ s-version 'x' || non-stateful to stateful upgrade || stateless (1) || Upgrading to first version of Upstart supporting stateful re-exec || ||{{{NSU-X}}} || low || non-Upstart init, not supported || supported @ s-version 'x' || non-stateful to stateful upgrade || stateless (1) || Non-Upstart init daemon upgrading to a version of Upstart supporting stateful re-exec || ||{{{SSU-E}}} || high || supported @ s-version 'x' || supported @ s-version 'x' || stateful to stateful upgrade || stateful || Expected common case (2) - versions equal. || ||{{{SND}}} || medium || supported @ s-version 'x' || not supported || stateful to non-stateful downgrade || stateless || || ||{{{SSU-G}}} || high || supported @ s-version 'x' || supported @ s-version 'y' || stateful to newer stateful version upgrade || stateful || moving to newer (greater than) version || ||{{{SSD-L}}} || medium || supported @ s-version 'y' || supported @ s-version 'x' || stateful to older stateful version downgrade || stateful (3) || moving to old (less than) version || Key and Notes: * The importance column refers to the relative priority of particular scenarios: "high" '''must''' be supported, "medium" '''may''' be supported (this cycle). * Terminology * "supported" refers to re-exec support being available. * "s-version" refers to the serialisation data format version, not the version of Upstart itself. * "stateful" refers to stateful re-exec. * "stateless" refers to a "bare" re-exec with no state passing. * {{{NSU-X}}} is essentially the same scenario as {{{NSU}}} but is listed for completeness. This scenario may become relevant to [[http://www.debian.org|Debian]] very soon. Ubuntu handled this scenario when Upstart was first introduced by '''not''' re-exec'ing `/sbin/init` after Upstart was first installed but having Upstart be the init daemon post-reboot. * Footnotes * (1) - non-stateful version of Upstart is unaware of newer versions stateful re-exec abilities. * (2) - the serialisation data format version should change as little as possible. * (3) - Newer versions of Upstart must support all previous serialisation versions. == Risks == === System Testing === Currently, there is no facility that would allow re-exec scenarios to be exercised against multiple {{{/sbin/init}}} binaries. That is to say, although the scenarios can be tested indirectly by generating the appropriate JSON and piping it into {{{/sbin/init}}}, there is no way to '''automatically''' test the real scenario where, for example, version 'X' of Upstart is upgraded to version 'Y'. === 3rd Party Library === Making use of a 3rd-party library for JSON parsing is a risk in that it won't be using the [[https://launchpad.net/libnih|NIH Utility Library]] and so won't have all the benefits associated with using it. To mitigate this risk, the chosen library will be audited and improved where necessary before being used. Additionally, the set of tests for this feature will be extremely large and must cover all possible failure scenarios. === Unrepresentable State === This design requires any changes to internal data structures to be accompanied by: * Appropriate updates to the serialisation and deserialisation code. * Additional stateful re-exec tests to cover the changed internals. There are 2 problems with this: * It's a manual process, so great care needs to be taken to ensure it happens! :) This could be mitigated by somehow auto-generating the serialisation/deserialisation code, but that would be very complex. A cheap compromise would be to ensure that every change to Upstart forced a complete set of '''system''' tests to run that exercise all possible stateful re-exec scenarios to atleast ensure that stateful re-exec will not break. * It makes any future changes to Upstart internals potentially very costly in terms of time and could slow down development speed as a result. Aside from automation, it is unclear how to further mitigate this issue. It is worth noting that Scott included (partial?) stateful re-exec support in early revisions of Upstart (versions {{{0.2.0}}} to {{{0.3.2}}} inclusive), but eventually dropped this feature due to the maintenance cost. === D-Bus handling === D-Bus marks all sockets as O_CLOEXEC and it does not appear there is an easy way to determine the fds D-Bus has in use and clear that bit. == Testing == This is a large feature which will require extremely careful unit, functional, and system testing. All failure scenarios should be included in the tests (where possible). * Ensure that a stopped system job can be serialised, deserialised and started. * Ensure that a system job blocked on an event can be serialised, deserialised and started. * Ensure that a user job blocked on an event can be serialised, deserialised and started. * Ensure that an event blocked on a system job can be serialised, deserialised and started. * Ensure that an event blocked on a user job can be serialised, deserialised and started. * Ensure that a stopped system job in a chroot can be serialised, deserialized and started. * Ensure that a stopped user job can be serialised, deserialised and started. {{{ for proc in pre-start main post-start pre-stop post-stop do for type in system user do for action in "allowed to continue" stopped restarted do echo "Ensure that a $type job running the $proc process can be serialised, deserialised and $action". done done done }}} * Ensure that a task which dies ''after'' serialisation but ''before'' deserialisation is handled. * Ensure that a respawn service which dies ''after'' serialisation but ''before'' deserialisation is handled. * Ensure that a respawn service which forks once ''after'' serialisation but ''before'' deserialisation is handled. * Ensure that a respawn service which forks twice ''after'' serialisation but ''before'' deserialisation is handled. == Impossible Scenarios == The following is a list of scenarios that cannot be handled directly: * Upstart is downgraded to a version that requires an older libnih, but libnih itself is not downgraded. * libnih's ABI changes so it is upgraded. Upstart is then re-exec'ed but either this happens before an updated Upstart package is installed or no new Upstart package is available to install. These scenarios must be handled via packaging policy. == Unresolved Issues == * If Upstart is downgraded to a version which supports stateful re-exec but whose serialisation data format cannot represent the current state, should Upstart refuse to perform stateful re-exec or simply "do the best it can"? Ideally, any newer version of Upstart will support every previous serialisation data format version such that this scenario be handled correctly. However: * Downgrading like this should not be a common operation (so should we spend the effort doing this?) * Ideally we need a way to query the serialisation data formation version prior to attempting the re-exec. We could add a {{{--serialisation-version}}} flag and have Upstart fork, run "{{{/sbin/init --serialisation-version}}}" first and if it detects that it cannot represent the state for the new version refuse to retain state on re-exec. == Limitations == We are extremely keen to involve the community from an early stage to aide in testing and to allow them to provide feedback on this useful feature. However, since the code is still in development, this initial "preview" of the stateful re-exec feature will have a number of limitations: * Not yet possible to pass D-Bus connections across the re-exec. (requires {{{dbus_connection_open_from_fd()}}} from lp:~jamesodhunt/dbus/create-connection-from-fd) '''DETAILS:''' all D-Bus clients, including the Upstart bridges will be forcibly disconnected from Upstart. This means that for example any new udev events resulting from plugging hardware post-boot will not be propagated back to Upstart for the duration between the bridge stopping and (the new) Upstart respawning it. '''IMPACT:''' Medium/High. '''OUTCOME:''' Must be fixed. * Downgrading of Upstart (to a version that does not support stateful re-exec) not yet handled fully. '''IMPACT:''' An impotent process will be left post re-exec that will linger until either killed or the system is shut down. '''OUTCOME:''' Should be fixed. * Upstart cannot yet work in the initramfs reliably. '''DETAILS:'''more precisely, if a job existed in the initramfs with the same name as a job in the root filesystem context, the re-exec'ed Upstart in the root file system would not have a correct view of its version of the job until that job configuration files changed after the re-exec. '''IMPACT:''' Low - Ubuntu does not yet use Upstart in the initramfs. Added to which, Upstart can now operate with '''no''' initramfs in the common-case. '''OUTCOME:''' Should be fixed. == Additional Information == A debug facility should be added to Upstart where it exposes a D-Bus method that allows the current state to be serialised when called. This can then be used as work progresses to ensure expected results. ---- CategorySpec