Prometheus

Our Prometheus endpoint exposes metrics about Ondat artefacts (such as volumes), as well as internal Ondat components.

Customers may scrape these metrics using Prometheus itself, or any compatible client, such as the popular Telegraf agent shipped with InfluxDB.

Artefact Metrics

Artefact metrics are those which instrument a specific Ondat artefact. Typically these relate to volumes. These are useful for general purpose monitoring.

Volume metrics make use of the following Ondat logical concepts:

  • Frontend - the layer serving a volume to an application. Always on the same host as the application.
  • Network - the layer serving IO to remote volumes
  • Backend - the layer representing IO to/from physical media. Not necessarily on the same host as an application.
Name Explanation Additional Notes
storageos_volume_backend_read_bytes_total Backend read operations Bandwidth/volume
storageos_volume_backend_read_total Backend read operations Can be used to derive IOPS
storageos_volume_backend_write_bytes_total Backend write bytes Bandwidth/volume
storageos_volume_backend_write_total Backend write operations Can be used to derive IOPS
storageos_volume_capacity_bytes Provisioned volume size bytes
storageos_volume_frontend_read_bytes_total Frontend read bytes Bandwidth/volume
storageos_volume_frontend_read_error_total Frontend read errors
storageos_volume_frontend_read_total Frontend read operations Can be used to derive IOPS
storageos_volume_frontend_write_bytes_total Frontend write bytes Bandwidth/volume
storageos_volume_frontend_write_error_total Frontend write errors
storageos_volume_frontend_write_total Frontend write operations Can be used to derive IOPS
storageos_volume_network_read_bytes_total Network read bytes Bandwidth/volume
storageos_volume_network_read_error_total Network read errors
storageos_volume_network_read_retry_total Network read retries
storageos_volume_network_read_total Network read operations Can be used to derive IOPS
storageos_volume_network_read_wait_retry_total Network read delayed retries QOS related
storageos_volume_network_write_bytes_total Network write bytes Bandwidth/volume
storageos_volume_network_write_error_total Network write errors
storageos_volume_network_write_retry_total Network write retries
storageos_volume_network_write_total Network write operations Can be used to derive IOPS
storageos_volume_network_write_wait_retry_total Network write delayed retries QOS related
storageos_volume_utilisation_actual_bytes Backend non-zero device blocks A count of 1MiB chunks that have been written to since volume creation. The count is taken before compression, encryption etc.
storageos_volume_utilisation_apparent_bytes Backend storage size Total blob file size on the host machine

Node Metrics

Our node metrics instrument various aspects of our container operation. These are illustrative of the health of your cluster, and we may ask you to provide them during a support engagement.

Name Explanation Additional Notes
exposer_bytes_transferred bytesTransferred to metrics services
exposer_request_latencies Latencies of serving scrape requests, in microseconds
exposer_request_latencies_count Total number of exposer_request_latencies
exposer_request_latencies_sum Sum of all exposer_request_latencies
exposer_total_scrapes Number of times metrics were scraped
go_gc_duration_seconds A summary of the GC invocation durations.
go_gc_duration_seconds_count Total number of go_gc_duration_seconds
go_gc_duration_seconds_sum Sum of all go_gc_duration-seconds
go_goroutines Number of goroutines that currently exist.
go_info Information about the Go environment.
go_memstats_alloc_bytes Number of bytes allocated and still in use.
go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
go_memstats_frees_total Total number of frees.
go_memstats_gc_cpu_fraction The fraction of this program’s available CPU time used by the GC since the program started.
go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
go_memstats_heap_objects Number of allocated objects.
go_memstats_heap_released_bytes Number of heap bytes released to OS.
go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
go_memstats_lookups_total Total number of pointer lookups.
go_memstats_mallocs_total Total number of mallocs.
go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
go_memstats_other_sys_bytes Number of bytes used for other system allocations.
go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
go_memstats_sys_bytes Number of bytes obtained from system.
go_threads Number of OS threads created.
storageos_control_process_cpu_seconds_total Total user and system CPU time spent in seconds (Ondat control process)
storageos_control_process_major_faults_total Total number of major page faults initiated by the process (Ondat control process)
storageos_control_process_max_fds Maximum number of open file descriptors (Ondat control process)
storageos_control_process_open_fds Number of open file descriptors (Ondat control process)
storageos_control_process_resident_memory_bytes Resident memory size in bytes (Ondat control process)
storageos_control_process_start_time_seconds Start time of the process since unix epoch in seconds (Ondat control process)
storageos_control_process_sys_cpu_seconds_total Total system CPU time spent in seconds (Ondat control process)
storageos_control_process_threads_total Number of currently spawned threads (Ondat control process)
storageos_control_process_user_cpu_seconds_total Total user CPU time spent in seconds (Ondat control process)
storageos_control_process_virtual_memory_bytes Virtual memory size in bytes (Ondat control process)
storageos_control_process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes (Ondat control process)
storageos_dataplane_process_cpu_seconds_total Total user and system CPU time spent in seconds (Ondat dataplane process)
storageos_dataplane_process_major_faults_total Total number of major page faults initiated by the process (Ondat dataplane process)
storageos_dataplane_process_max_fds Maximum number of open file descriptors (Ondat dataplane process)
storageos_dataplane_process_open_fds Number of open file descriptors (Ondat dataplane process)
storageos_dataplane_process_resident_memory_bytes Resident memory size in bytes (Ondat dataplane process)
storageos_dataplane_process_start_time_seconds Start time of the process since unix epoch in seconds (Ondat dataplane process)
storageos_dataplane_process_sys_cpu_seconds_total Total system CPU time spent in seconds (Ondat dataplane process)
storageos_dataplane_process_threads_total Number of currently spawned threads (Ondat dataplane process)
storageos_dataplane_process_user_cpu_seconds_total Total user CPU time spent in seconds (Ondat dataplane process)
storageos_dataplane_process_virtual_memory_bytes Virtual memory size in bytes (Ondat dataplane process)
storageos_dataplane_process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes (Ondat dataplane process)
storageos_local_leader_detected_node_failed_total Number of nodes the leader attempted to mark as failed
storageos_local_leader_known_nodes_sync_seconds_bucket Time taken to synchronise known node list from data store
storageos_local_leader_known_nodes_sync_seconds_count Total number of storageos_local_leader_known_nodes_sync_seconds_count
storageos_local_leader_known_nodes_sync_seconds_sum Total of all storageos_local_leader_known_nodes_sync_seconds_count
storageos_local_leader_known_nodes_total Number of nodes being actively monitored by this leader
storageos_local_leader_volume_master_recover_total Number of master volumes the leader recovered
storageos_node_device_capacity_bytes Total device capacity usable by Ondat
storageos_node_device_free_bytes Available device capacity usable by Ondat
storageos_node_dp_config_sync_seconds_bucket Time taken to compare desired dataplane config state versus actual and apply differences
storageos_node_dp_config_sync_seconds_count Total number of storageos_node_dp_config_sync_seconds
storageos_node_dp_config_sync_seconds_sum Total of storageos_node_dp_config_sync_seconds
storageos_node_volumes_total Volumes on this node
storageos_stats_process_cpu_seconds_total Total user and system CPU time spent in seconds (Ondat stats process)
storageos_stats_process_major_faults_total Total number of major page faults initiated by the process (Ondat stats process)
storageos_stats_process_max_fds Maximum number of open file descriptors (Ondat stats process)
storageos_stats_process_open_fds Number of open file descriptors (Ondat stats process)
storageos_stats_process_resident_memory_bytes Resident memory size in bytes (Ondat stats process)
storageos_stats_process_start_time_seconds Start time of the process since unix epoch in seconds (Ondat stats process)
storageos_stats_process_sys_cpu_seconds_total Total system CPU time spent in seconds (Ondat stats process)
storageos_stats_process_threads_total Number of currently spawned threads (Ondat stats process)
storageos_stats_process_user_cpu_seconds_total Total user CPU time spent in seconds (Ondat stats process)
storageos_stats_process_virtual_memory_bytes Virtual memory size in bytes (Ondat stats process)
storageos_stats_process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes (Ondat stats process)
storageos_store_query_seconds_bucket Data store query latency by operation type May reveal problems with external etcd
storageos_store_query_seconds_count Total number of storageos_store_query_seconds
storageos_store_query_seconds_sum Total of all storageos_store_query_seconds