MonitoringAndAlerting

Summary

Release Note

Rationale

User stories

As a UEC admin I'm alerted by UEC when physical nodes go down or services are flaky.

As a UEC admin I can integrate my UEC deployement into my existing nagios system.

Assumptions

Design

Write the code to measure once - make it available to multiple readers.

Reader examples: collectd, munin, snmp, reconnoiter, nagios.

Implementation

Test/Demo Plan

Unresolved issues

BoF agenda and discussion

UDS Natty discussion

Follow up of https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-uec-monitoring.

User stories:
As a UEC admin I'm alerted by UEC when physical nodes go down or services are flaky.
As a UEC admin I can integrate my UEC deployement into my existing nagios system.

UEC Monitoring uses a munin plugin based on the ganglia plugin packaged with Eucalyptus.  If munin is installed, it sort of auto-configures.

OpenStack has its own instrumentation that collects information on the running virtual machines and stores it.

What to measure on a UEC deployment?
 Now:
 * number of vms
 * memory 
 * disk space
 * cpu time used by instance
 * memory consumed by instance
 * disk stats by instance
 measured by a  script in eucalyptus

 Next:
 * Node controller:
   * number of instance running
   * resources used by each instance: number of core, disk available, memory
   * generic stats: network io, disk io, power consumption
   * statistics about each instance: kvm information, cpu load
   * ksm
   * disk io per instances
 * Storage controller:
   * disk io
   * network io
 * instance creation.
 * is your cloud full?
   capacity of the cloud: can more instances be spanwed?
   all the resources that a user can create/request.
 * power utilization?

Which framework to write probes for?
 * nagios:
   - already in main.
   - what to alert on?
 * collectd:
   - MIR status:
     splitting the package into two sources packages.
   - supports multiple output.
   - has a lot of dependencies. Performs well but tightly coupled with instrumentation.
   - doesn't support graphing.
 * munin:
   - already in main
   - already used in UEC monitoring framework
 * snmp:
   - already in main
 * ganglia:
   - in universe
   - write to only rrd files.
 * zenos:
   - not packaged
   - pull based system - ie agentless.
   - upstream willing to help.

How to present the graph to sysadmin?
 * munin
 * collectd
 * snmp


Configuration:
 * unicast - point-to-point 
 
Alerting:
 * nagios may go away: there is a nagios fork. icinga (mainly changes around the web ui)? www.icinga.org
 * shinken: complete rewrite in python.
 * flapjack:
 
Actions:
 * move collectd to main.
 * should munin go to universe (probably not yet)
 * find a graphing solution (munin, graphite, reckonater (omniti - not packaged, visage).


CategorySpec

ServerTeam/Specs/Natty/MonitoringAndAlerting (last edited 2010-11-05 02:38:30 by dsl-173-206-78-27)