Pod Placement
Ondat has the capacity to influence Kubernetes Pod placement decisions to
ensure that Pods are scheduled on the same nodes as their data. This
functionality is known as Pod Locality
.
Ondat grants access to data by presenting, on local or remote nodes, the devices used in a Pod’s VolumeMounts. However, it is often the case that it is required or preferred to place the Pod on the node where the Ondat Primary Volume is located, because IO operations are fastest as a result of minimized network traffic and associated latency. Read operations are served locally and writes require fewer round trips to the replicas of the volume.
Ondat automatically enables the use of a custom scheduler for any Pod using Ondat Volumes. Checkout the Admission Controller reference for more information.
Locality modes
There are two modes available to set pod locality for Ondat Volumes.
Preferred
The Pod SHOULD be placed alongside its data, if possible. Otherwise, it will be placed alongside volume replicas. If neither scenario is possible, the Pod will start on another node and Ondat will grant access to the data over the network.
Preferred mode is the default behaviour when using the Ondat scheduler.
Strict
The Pod MUST be placed alongside its data, i.e. on a node with the master volume or a replica. If that is not possible, the Pod will remain in pending state until the premise can be fulfilled.
The aim of strict mode is to provide the user with the capability to guarantee best performance for applications. Some applications are required to give a certain level of performance, and for such applications strict co-location of application and data is essential.
For instance, when running Kafka Pods under heavy load, it may be better to avoid scheduling a Pod using a remote volume rather than have clients direct traffic at a cluster member which exhibits degraded performance.
To see examples on how to set a mode for your Pods, check out the examples reference page.
Storageos Kubernetes Scheduler
Ondat achieves Pod locality by implementing a Kubernetes scheduler extender. The Kubernetes standard scheduler interacts with the Ondat scheduler when placement decisions need to be made.
The Kubernetes standard scheduler selects a set of nodes for a placement decision based on nodeSelectors, affinity rules, etc. This list of nodes is sent to the Ondat scheduler which sends back the target node where the Pod shall be placed.
The Ondat scheduler logic is provided by a Pod in the Namespace where Ondat Pods are running.
Scheduling process
When a Pod needs to be scheduled, the scheduler collects information about all available nodes and the requirements of the Pod. The collected data is then passed through the Filter phase, during which the scheduler predicates are applied to the node data to decide if the given nodes are compatible with the Pod requirements. The result of the filter consists of a list of nodes that are compatible for the given Pod and a list of nodes that aren’t compatible.
The list of compatible nodes is then passed to the Prioritize phase, in which the nodes are scored based on attributes such as the state. The result of the Prioritize phase is a list of nodes with their respective scores. The more favorable nodes get higher scores than less favorable nodes. The list is then used by the scheduler to decide the final node to schedule the Pod on.
Once a node has been selected, the third phase, Bind, handles the binding of the Pod to the Kubernetes apiserver. Once bound, the kubelet on the node provisions the Pod.
The Ondat scheduler implement Filter and Prioritization phases and leaves binding to the default Kubernetes scheduler.
Available +------------------+ +------------------+
NodeList & Pod | | Filtered NodeList | | Scored
Information | | & Pod Information | | NodeList
+-------------------->+ Filter +-------------------->+ Prioritize |--------------->
| (Predicates) | | (Priorities) |
| | | |
+------------------+ +------------------+
Scheduling Rules
The Ondat scheduler filters nodes ensuring that the remaining subset fulfill the following prerequisites:
- The node is running Ondat
- The node is healthy
- The node is not Ondat Cordoned
- The node is not in a Ondat Drained state
- The node is not a Ondat compute-only node
The scoring protocol once the nodes are filtered is as follows:
- Node with master volume - 15 points
- Node with replica volume - 10 points
- Node with no master or replica volume - 5 points
- Node with unhealthy volume or unsynced replica - 1 point
Admission Controller
Ondat implements an admission controller that ensures any Pod using Ondat Volumes are scheduled by the Ondat Scheduler. This makes the use of the scheduler transparent to the user. To learn how to alter this behaviour, see reference page.
The Admission Controller is based on admission webhooks. Therefore, no custom
admission plugins need to be enabled at bootstrap of your Kubernetes cluster.
Admission webhooks are HTTP callbacks that receive admission requests and do
something with them. The Ondat Cluster Operator serves the admission
webhook. So when a Pod is in the process of being created, the Ondat
Cluster Operator mutates the spec.schedulerName
ensuring the
storageos-scheduler
is set.