Monitoring Ondat
Ingesting Ondat Metrics
Ondat metrics are exposed on each cluster node at
http://ADVERTISE_IP:5705/metrics
. For a full list of metrics that the
endpoint provides please see Prometheus Endpoint.
Metrics are exported in Prometheus text
format,
so collectors such as
Prometheus,
Telegraf or
Sensu
can be used. The examples on this page will reference Prometheus semantics.
For an example Prometheus and Grafana setup monitoring Ondat please see the example here.
Analysing Metrics
There are many metrics exposed by the Prometheus endpoint, but without a good understanding of what each metric is measuring, they may be difficult to interpret. To aid the visualisation of metrics a Grafana dashboard has been made available here.
Ondat Volume Metrics
Measuring IOPS
One of the most popular ways to measure the efficacy of a device is to measure
the number of Input/Output Operations per Seconds (IOPS) the device can
achieve. storageos_volume_frontend_write_total
and
storageos_volume_frontend_read_total
can be used to calculate the IOPS rate
using builtin Prometheus functions.
The metrics themselves are counters that report the total read/write operations
for a volume from the application perspective. As a counter can only
increase over time,
the prometheus rate()
function needs to be applied to get a measure of
operations over time.
rate(storageos_volume_frontend_write_total[2m])
The Prometheus rate function calculates the per-second average rate of increase for a counter, over the 2 minute time period given. So, the function above gives the per-second average of writes over two minutes. Therefore, if the rate of both read and write totals is taken they can be summed to give IOPS.
Measuring Bandwidth
While IOPS is a measure of operations per second, bandwidth provides a
measure of throughput, usually in MB/s.
storageos_volume_frontend_write_bytes_total
and
storageos_volume_frontend_read_bytes_total
are exposed as a way to calculate
bandwidth from the application’s perspective.
These metrics are counters that report the total bytes read from/written to a volume. As with IOPS, a rate can be calculated to give the average number of bytes per second.
rate(storageos_volume_frontend_write_bytes_total[2m])
As with IOPS, the function above gives the per-second average increase in bytes written to a volume, therefore if the rate of read and write byte totals is summed you have the total volume bandwidth.
Frontend vs Backend Metrics
The Ondat Prometheus endpoint exposes both frontend and backend volume metrics. The frontend metrics relate to I/O operations against a Ondat volume’s filesystem. These operations are those executed by applications consuming Ondat volumes. Backend metrics relate to I/O operations that the Ondat container runs against devices that store the blob files. They are affected by Ondat features such as compression and encryption which the application is unaware of.
Ondat Node Metrics
The metrics endpoint exposes a standard set of metrics for every process that the Ondat container starts, including the metrics below.
Uptime
The Ondat control plane is the first process that starts when a Ondat
pod is created. The storageos_control_process_start_time_seconds
is a gauge
that provides the start time of the control plane process since the Unix epoch.
time() - storageos_control_process_start_time_seconds{alias=~"$node"}
By subtracting the control plane start time from the current time since the Unix epoch, the total uptime of the process can be derived.
CPU Usage
The Ondat container will spawn a number of different processes. To
calculate the total CPU footprint of the Ondat container, these processes
need to be summed together. *_cpu_seconds
metrics are counters that reflect
the total seconds of CPU time each process has used.
(rate(storageos_control_process_cpu_seconds_total[3m]) +
rate(storastorageos_dataplane_process_cpu_seconds_total[3m]) +
rate(storastorageos_stats_process_cpu_seconds_total[3m])) * 100
To calculate the average number of seconds of CPU time used per second, a rate must be taken. The rate expresses the fraction of 1 second of CPU time that was used by the Ondat process in one second. Therefore to express this as a percentage, multiply by 100.
Memory Usage
*_resident_memory_bytes
metrics are gauges that show the current resident
memory of a Ondat process. Although metrics about virtual memory usage are
also exposed, resident memory gives an overview of memory allocated to each
process that is actively being used.
storageos_control_process_resident_memory_bytes
storageos_director_process_resident_memory_bytes
storageos_stats_process_resident_memory_bytes
As with CPU usage the resident memory of each Ondat process needs to be summed to calculate the memory footprint of Ondat processes.
Volumes per Node
Ondat has two volumes types; masters and replicas. A master volume is the device that a pod mounts and the replicas are hot stand-bys for the master volume.
sum(storageos_node_volumes_total{alias=~"$node"}) by (alias, volume_type)
By summing across the Prometheus alias
and
volume_type
labels the number of master and replica volumes per node can be
found. Changes in the relative numbers of master and replicas indicate that volumes
have failed over, assuming that no new volumes or replicas have been created.