System Garden

Habitat 1.0 User Manual

Contents

  1. A Tour of Habitat
  2. Getting Started
  3. Concepts
  4. Clockwork: The Collection Agent
  5. Graphical Tools
  6. Text Terminal Tools
  7. Command Line Tools
  8. System Performance
  9. Events
  10. Administration
  11. Diagnostics
  12. Appendix

System Performance

Data gathered by clockwork's built-in probes are designed to aid in capacity planning and problem resolutions. Usually, that relies the trend in consumption of major resources: processor, memory, storage and networks. However, other minor indicators can also be used: highly system specific but very important.

For certain situations, log monitoring and event recording are also important, especially when showing data coincidence.

Indicators

The main indicators of capacity usage: processor, memory, storage and networking are handled in a very similar way across Unix-like systems (although memory management will vary more than other resources). In all systems, these measurements can be represented in a similar way and are the primary indicators of the capacity of a given machine. However, these should always be read with the in conjunction with the system specific indicators.

System

The processor and memory statistics are collected by the same probe and appended into a ring named sys, which appears as the leaf node system in the choice tree.

A full list of data that collected (at the time of writing) is held in an appendix, however a few notable indicators are described below.

Habitat's sys probe calculates a %work value (0-100 range), which indicates how much time is spent processing in all categories. In Linux this is %user+%system+%nice (user time, kernel time and low priority user time). In Solaris, this is just %user+%system.

%idle in Linux 2.4 shows what proportion of time is left over after processing has been completed; in Solaris and Linux 2.6, one needs to add %idle+%wait together to get the same figure. %wait stands for wait I/O, the amount of time that processes wait on blocked I/O.

The indicators load1, load5 and load15 report the 1, 5 and 15 minute load average computed by the Unix kernel. This is a traditional value of overall load that almost all Unix-like operating system report and is an exponential decline function on the number of runnable processes

Users of habitat should refer to operating system specific texts to better understand the meaning behind the indicators.

Storage

Storage statistics are collected by the io probe and appear as storage under the choice tree. Both capacity information (how full your disks are) as well as performance information is given in the same probe

A full list of data that collected (at the time of writing) is held in an appendix, however a few notable indicators are described below.

Capacity information is given by size, used and reserved, which fits the Unix model of reserved storage. %used is also calculated by the probe to shown what proportion of (used-reserved) taken. 100% used means all user space on the device has been taken, leaving only the reserved for administration working area.

Performance information is given by read and write operations and storage transferred. For systems that support it, service time for read and write is also offered, which can be very helpful in working out service levels.

The storage ring holds multi instance data, that is, each device can provide the same measuring characteristics if they are available. In ghabitat, this manifests as an additional scrolling list to select the devices to display (as described above).

Network

Network data is collected by the net probe and appears as network in the choice tree.

A full list of data that collected (at the time of writing) is held in an appendix, however a few notable indicators are described below.

Data collected is split between read and transmit statistics, known by their identifier prefixes rx_ and tx_. Typical indicators include packets (a measure of throughput), total bytes, errors (malformed data), and collisions (high for busy shared Ethernets), etc.

A system has multiple interfaces, even if one of the is a loopback (typically lo0). Thus, when displaying network data in ghabitat, the multi interface mode operates and multiple interfaces may be selected for drawing.

Other Indicators

Other than the probes that collect the primary indicators, habitat also collects other data.

Four co-operating probes, named up, down, boot and alive, collect availability data which is displayed in the choice tree under the label uptime. Other than collecting some system specific information, the probes show when the system was last booted and create a history of down time that can be used in service levels.

Each operating system has a set of parameters which describe the operation of its kernel and the configuration. This is collected by probe called name which presents it data as symbols under the choice tree. The probe runs each time that clockwork started to collect the current configuration in the form of a simple key-value list.

Hardware interrupts are collected by a probe named intr and presented as interrupts under the choice tree. It shows interrupts of various sorts against real or synthetic devices. This measure can be quite system specific.

Adding to the standard data

New data is easily added to habitat and is covered in Administration and Programming manuals.

Conceptually, once the chosen data is collected in the correct format (FHA generally), it needs to be appended to its own ring in a habitat data store (using habput or programmatically using the route interface). Once there, it can be displayed using ghabitat in tabular or graphical form under data in the choice tree.

Ringstores should be used unless harvest is installed, when SQL ringstore (sqlrs:) also becomes an option. It is also possible to use a ringstore and replicate to sqlrs: which is another configurable process and one employed by habitat as the most convenient method.

Synthesising New Values

In addition to data that is recorded directly to a ring table following its measurement, a collecting probe may also synthesise its own values. This may be to abstract a measurement from system specific indicators or to make a more useful measurement, usually combining several native values.

This data is recorded at the same time as the native values in its own column, so that it shares the same sample time and sequence. However, it is advised that synthetic data is indicated as such in the info field describing the column.

In the system probe, %work is an example of synthesised data.

What is Abnormal?

This User Manual does not attempt to provide a guide to interpreting performance characteristics, which is a significant subject in its own right, with many tests dedicated to it.

However, certain things to watch out for include: in the system ring, prolonged use of processors (unless by design), high paging and any significant swapping. In the network ring: high error or collision rates, and in the storage ring long service times or high usage resulting in little free space.

Further Reading