ServerKarmicCloudPowerManagement

Revision 12 as of 2009-06-18 17:28:10

Clear message

Summary

In Jaunty, we proved that at least some Ubuntu servers can successfully, suspend and hibernate, as well as resume from remote triggers such as wake-on-lan. In Karmic, we plan to use this technology to make Ubuntu Enterprise Clouds more energy efficient than other cloud solutions, by ensuring that a minimum number of cloud nodes are powered up at any given time.

This specification divides Cloud Power Management into 3 distinct parts:

  1. Putting nodes to sleep (powernap)
  2. Waking nodes up (powerwake)
  3. Compressing the cloud (eucalyptus)

Release Note

Ubuntu 9.10 Enterprise Cloud deployments can leverage dynamic power management facilities. Unused nodes are suspended (or powered down), and are resumed (or powered on) only when necessary, yielding a more energy efficient data center.

Rationale

Cloud Computing offers versatile, scalable resources in the form of virtual machines scattered across server hardware. The cloud controller can handle starting, stopping, and migrating these virtual machines on the real hardware. In the interest of energy efficiency, a minimum amount of server hardware should be powered on and running in an Ubuntu Enterprise Cloud.

User stories

  1. Kim is the administrator of a cloud which is often under utilized. She's also tasked with making her data center as energy efficient as possible. Ideally, each server in her entire cloud would run in a low-power state (suspended, hibernated, or powered off), and only wake when necessary.
  2. Donna is the administrator of a cloud that she would like to periodically defragment. She would like to re-factor the deployment of the virtual machines running in the cloud, such that a minimum amount of hardware is necessary, either manually or automatically. After the migration, unused hardware should be put into a lower power, standby state.

Assumptions

  1. The suspend/hibernate/resume functionality should only be enabled on server hardware that supports such power management features. Power-on/power-off will need to be used otherwise. We need a mechanism for detecting the support, or lack thereof.
  2. A mechanism for remotely waking systems is necessary, and may require support in the BIOS or auxiliary hardware. Wake-on-lan is the simplest remote-wake mechanism. We will focus on this in the short term.
  3. KVM live migration needs to be reliable and highly performant for the defragmentation feature. This necessitates persistent, shared storage, such as iscsi or a clustered filesystem.

Design

In Karmic, we hope to integrate into existing cloud management frameworks:

  1. A configurable utility installed on cloud nodes (or any Ubuntu server, really) that takes a prescribed action (such as lowering the power state) when certain conditions are met (such as not running some specific processes for some number of seconds)

    • powernap - in Karmic universe as of 2009-06-15

  2. A utility installed on a cloud controller (or any router/server, really) that can wake sleeping systems according to a method configured for the target system (wakeonlan, ipmi, nut, etc)
    • should be another binary package (powerwake?) under the powernap source package
    • Eucalyptus (and others) will need to be modified to use powerwake when a target system appears offline
  3. A cloud controller with a node selection algorithm ensuring that new virtual machines are deployed onto running systems with spare capacity first, and new systems are only powered-up when need arises
    • Eucalyptus already has an algorithm that sort of does this, "greedy scheduling"; it simply needs to be enhanced to give priority to running systems
  4. Automatic algorithms that can perform these operations in an unattended manner when certain configurable thresholds or conditions are met
    • This compression/defragmentation/refactoring needs to be added to Eucalyptus

Implementation

This section should describe a plan of action (the "how") to implement the changes discussed. Could include subsections like:

Test/Demo Plan

Attach a watt-meter in series with the power drawn for some set of cloud hardware. Ensure that no virtual machines are running in the cloud. Observe the unused nodes enter lower power states, as configured. Observe a minimum value of wattage drawn by the cloud. Request new VM's from the cloud controller. Observe nodes waking and serving the newly requested virtual machines, and the increase in power drawn. Destroy virtual machines, and observe nodes powering down and power consumption decreasing.

Unresolved issues

  • Persistent shared storage for live migration

UDS Raw Notes

  • Machines migrate when there is lite load.
  • For Eucalyptus
    • Boot nodes into suspended state.
    • Wake up as needed.
  • Use live migration to compress a cloud into the minimum amount of hardware.
    • Suspend unneeded hardware.
  • pm-suspend and pm-hibernate are now in server seed.
  • Re-establishing network connectivity after resume may be an issue.
  • Manual process of compressing the cloud hardware.
    • SSH to send and wakeonlan to resume.
  • With new improvements to boot speed actually powering off may be useful.
  • Hibernate may be preferable when reloading a cache.
  • Support poweroff, hibernate and suspend depending on the admin preference.
  • Support IPMI as well as wakeonlan.
  • Make framework configurable as to what tools to use.
  • Do a wakeonlan testing day to generate a list of working hardware.
  • Wakeonlan
    • Simple
    • Doesn't scale
  • IPMI
    • Server hardware required.
    • Authentication supported
  • NUT
    • Network UPS Tool
    • Can also wake systems.
  • Use Cases:
    • Manually compress cloud hardware.
    • Process to actively monitor cloud load and adjust hardware according to load.
  • Use libvirt for VM migration.
  • Eucalyptus integration
    • Power Management needs to live in the same space as the service level agreement.
  • Integrate power management into Landscape.
    • Set SLAs in Landscape which then uses the plugins for power management.
  • Use application to monitor power consumption.
  • May have to target specific hardware.
  • Strongly consider adding power management hooks to libvirt.
    • Could automatically wake up a machine when connecting.
  • What is the appropriate amount of idle time before suspending a system.
  • Allow configuration of the amount of hot nodes.
  • Adjust cloud load based on node load.


CategorySpec