Monitoring Ondat

Ingesting Ondat Metrics

Ondat metrics are exposed on each cluster node at http://ADVERTISE_IP:5705/metrics. For a full list of metrics that the endpoint provides please see Prometheus Endpoint. Metrics are exported in Prometheus text format, so collectors such as Prometheus, Telegraf or Sensu can be used. The examples on this page will reference Prometheus semantics.

For an example Prometheus and Grafana setup monitoring Ondat please see the example here.

Analysing Metrics

There are many metrics exposed by the Prometheus endpoint, but without a good understanding of what each metric is measuring, they may be difficult to interpret. To aid the visualisation of metrics a Grafana dashboard has been made available here.

Ondat Volume Metrics

Measuring IOPS

One of the most popular ways to measure the efficacy of a device is to measure the number of Input/Output Operations per Seconds (IOPS) the device can achieve. storageos_volume_frontend_write_total and storageos_volume_frontend_read_total can be used to calculate the IOPS rate using builtin Prometheus functions.

The metrics themselves are counters that report the total read/write operations for a volume from the application perspective. As a counter can only increase over time, the prometheus rate() function needs to be applied to get a measure of operations over time.

rate(storageos_volume_frontend_write_total[2m])

The Prometheus rate function calculates the per-second average rate of increase for a counter, over the 2 minute time period given. So, the function above gives the per-second average of writes over two minutes. Therefore, if the rate of both read and write totals is taken they can be summed to give IOPS.

Measuring Bandwidth

While IOPS is a measure of operations per second, bandwidth provides a measure of throughput, usually in MB/s. storageos_volume_frontend_write_bytes_total and storageos_volume_frontend_read_bytes_total are exposed as a way to calculate bandwidth from the application’s perspective.

These metrics are counters that report the total bytes read from/written to a volume. As with IOPS, a rate can be calculated to give the average number of bytes per second.

rate(storageos_volume_frontend_write_bytes_total[2m])

As with IOPS, the function above gives the per-second average increase in bytes written to a volume, therefore if the rate of read and write byte totals is summed you have the total volume bandwidth.

Frontend vs Backend Metrics

The Ondat Prometheus endpoint exposes both frontend and backend volume metrics. The frontend metrics relate to I/O operations against a Ondat volume’s filesystem. These operations are those executed by applications consuming Ondat volumes. Backend metrics relate to I/O operations that the Ondat container runs against devices that store the blob files. They are affected by Ondat features such as compression and encryption which the application is unaware of.

Ondat Node Metrics

The metrics endpoint exposes a standard set of metrics for every process that the Ondat container starts, including the metrics below.

Uptime

The Ondat control plane is the first process that starts when a Ondat pod is created. The storageos_control_process_start_time_seconds is a gauge that provides the start time of the control plane process since the Unix epoch.

time() - storageos_control_process_start_time_seconds{alias=~"$node"}

By subtracting the control plane start time from the current time since the Unix epoch, the total uptime of the process can be derived.

CPU Usage

The Ondat container will spawn a number of different processes. To calculate the total CPU footprint of the Ondat container, these processes need to be summed together. *_cpu_seconds metrics are counters that reflect the total seconds of CPU time each process has used.

(rate(storageos_control_process_cpu_seconds_total[3m]) +
rate(storastorageos_dataplane_process_cpu_seconds_total[3m]) +
rate(storastorageos_stats_process_cpu_seconds_total[3m])) * 100

To calculate the average number of seconds of CPU time used per second, a rate must be taken. The rate expresses the fraction of 1 second of CPU time that was used by the Ondat process in one second. Therefore to express this as a percentage, multiply by 100.

Memory Usage

*_resident_memory_bytes metrics are gauges that show the current resident memory of a Ondat process. Although metrics about virtual memory usage are also exposed, resident memory gives an overview of memory allocated to each process that is actively being used.

storageos_control_process_resident_memory_bytes
storageos_director_process_resident_memory_bytes
storageos_stats_process_resident_memory_bytes

As with CPU usage the resident memory of each Ondat process needs to be summed to calculate the memory footprint of Ondat processes.

Volumes per Node

Ondat has two volumes types; masters and replicas. A master volume is the device that a pod mounts and the replicas are hot stand-bys for the master volume.

sum(storageos_node_volumes_total{alias=~"$node"}) by (alias, volume_type)

By summing across the Prometheus alias and volume_type labels the number of master and replica volumes per node can be found. Changes in the relative numbers of master and replicas indicate that volumes have failed over, assuming that no new volumes or replicas have been created.