MonitoringAndAlerting
Launchpad Entry: cloud-server-n-monitoring-alerting
Created:
Contributors:
Packages affected:
Summary
Release Note
Rationale
User stories
As a UEC admin I'm alerted by UEC when physical nodes go down or services are flaky.
As a UEC admin I can integrate my UEC deployement into my existing nagios system.
Assumptions
Design
Write the code to measure once - make it available to multiple readers.
Reader examples: collectd, munin, snmp, reconnoiter, nagios.
Implementation
Test/Demo Plan
Unresolved issues
BoF agenda and discussion
UDS Natty discussion
Follow up of https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-uec-monitoring. User stories: As a UEC admin I'm alerted by UEC when physical nodes go down or services are flaky. As a UEC admin I can integrate my UEC deployement into my existing nagios system. UEC Monitoring uses a munin plugin based on the ganglia plugin packaged with Eucalyptus. If munin is installed, it sort of auto-configures. OpenStack has its own instrumentation that collects information on the running virtual machines and stores it. What to measure on a UEC deployment? Now: * number of vms * memory * disk space * cpu time used by instance * memory consumed by instance * disk stats by instance measured by a script in eucalyptus Next: * Node controller: * number of instance running * resources used by each instance: number of core, disk available, memory * generic stats: network io, disk io, power consumption * statistics about each instance: kvm information, cpu load * ksm * disk io per instances * Storage controller: * disk io * network io * instance creation. * is your cloud full? capacity of the cloud: can more instances be spanwed? all the resources that a user can create/request. * power utilization? Which framework to write probes for? * nagios: - already in main. - what to alert on? * collectd: - MIR status: splitting the package into two sources packages. - supports multiple output. - has a lot of dependencies. Performs well but tightly coupled with instrumentation. - doesn't support graphing. * munin: - already in main - already used in UEC monitoring framework * snmp: - already in main * ganglia: - in universe - write to only rrd files. * zenos: - not packaged - pull based system - ie agentless. - upstream willing to help. How to present the graph to sysadmin? * munin * collectd * snmp Configuration: * unicast - point-to-point Alerting: * nagios may go away: there is a nagios fork. icinga (mainly changes around the web ui)? www.icinga.org * shinken: complete rewrite in python. * flapjack: Actions: * move collectd to main. * should munin go to universe (probably not yet) * find a graphing solution (munin, graphite, reckonater (omniti - not packaged, visage).
ServerTeam/Specs/Natty/MonitoringAndAlerting (last edited 2010-11-05 02:38:30 by dsl-173-206-78-27)