Troubleshooting
This section is aimed to help you troubleshoot issues in your cluster, whether they are related to the Ondat installation, integration with orchestrators or common misconfigurations.
Tools
To be able to troubleshoot issues the Ondat cli is required.
Pod in pending because of mount error
Issue:
The output of kubectl describe pod $POD_ID
contains no such file or directory
and references the Ondat volume device file.
root@node1:~# kubectl -n kube-system describe $POD_ID
(...)
Events:
(...)
Normal Scheduled 11s default-scheduler Successfully assigned default/d1 to node3
Warning FailedMount 4s (x4 over 9s) kubelet, node3 MountVolume.SetUp failed for volume "pvc-f2a49198-c00c-11e8-ba01-0800278dc04d" : stat /var/lib/storageos/volumes/d9df3549-26c0-4cfc-62b4-724b443069a1: no such file or directory
Reason:
There are two main reasons this issue may arise:
- The Ondat
DEVICE_DIR
location is wrongly configured when using Kubelet as a container - Mount Propagation is not enabled
(Option 1) Misconfiguration of the DeviceDir/SharedDir
Some Kubernetes distributions such as Rancher, DockerEE or some installations of OpenShift deploy the Kubelet as a container, because of this, the device files that Ondat creates to mount into the containers need to be visible to the kubelet. Ondat can be configured to share the device directory.
Modern installations use CSI, which handles the complexity internally.
Assert:
root@node1:~# kubectl -n default describe stos | grep "Shared Dir"
Shared Dir: # <-- Shouldn't be blank
Solution:
The Cluster Operator Custom Definition should specify the SharedDir option as follows.
spec:
sharedDir: '/var/lib/kubelet/plugins/kubernetes.io~storageos' # Needed when Kubelet as a container
...
See example on how to configure the Ondat Custom Resource.
(Option 2) Mount propagation is not enabled.
Applies only if Option 1 is configured properly.
Assert:
If not using the Kubelet as a container, SSH into one of the nodes and check if
/var/lib/storageos/volumes
is empty. If so, exec into any Ondat pod and
check the same directory.
root@node1:~# ls /var/lib/storageos/volumes/
root@node1:~# # <-- Shouldn't be blank
root@node1:~# kubectl exec $POD_ID -c storageos -- ls -l /var/lib/storageos/volumes
bst-196004
d529b340-0189-15c7-f8f3-33bfc4cf03fa
ff537c5b-e295-e518-a340-0b6308b69f74
If the directory inside the container and the device files are visible, disabled mount propagation is the cause.
If using the Kubelet as a container, SSH into one of the nodes and check if
/var/lib/kubelet/plugins/kubernetes.io~storageos/devices
is empty. If
so, exec into any Ondat pod and check the same directory.
root@node1:~# ls /var/lib/kubelet/plugins/kubernetes.io~storageos/devices
root@node1:~# # <-- Shouldn't be blank
root@node1:~# kubectl exec $POD_ID -c storageos -- ls -l /var/lib/kubelet/plugins/kubernetes.io~storageos/devices
bst-196004
d529b340-0189-15c7-f8f3-33bfc4cf03fa
ff537c5b-e295-e518-a340-0b6308b69f74
If the directory inside the container and the device files are visible, disabled mount propagation is the cause.
Solution:
Older versions of Kubernetes need to enable mount propagation as it is not enabled by default. Most Kubernetes distributions allow MountPropagation to be enabled using FeatureGates. Rancher specifically, needs to enable it in the “View in API” section of your cluster. You need to edit the section “rancherKubernetesEngineConfig” to enable the Kubelet feature gate.
PVC pending state - Failed to dial Ondat
A created PVC remains in pending state making pods that need to mount that PVC unable to start.
Issue:
root@node1:~/# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
vol-1 Pending storageos 7s
kubectl describe pvc $PVC
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 7s (x2 over 18s) persistentvolume-controller Failed to provision volume with StorageClass "storageos": Get http://storageos-cluster/version: failed to dial all known cluster members, (10.233.59.206:5705)
Reason:
For non CSI installations of Ondat, Kubernetes uses the Ondat API endpoint to communicate. If that communication fails, relevant actions such as create or mount volume can’t be transmitted to Ondat, hence the PVC will remain in pending state. Ondat never received the action to perform, so it never sent back an acknowledgement.
In this case, the Event message indicates that Ondat API is not responding, implying that Ondat is not running. For Kubernetes to define Ondat pods ready, the health check must pass.
Assert:
Check the status of Ondat pods.
root@node1:~/# kubectl -n kube-system get pod --selector app=storageos # for CSI add --selector kind=daemonset
NAME READY STATUS RESTARTS AGE
storageos-qrqkj 0/1 Running 0 1m
storageos-s4bfv 0/1 Running 0 1m
storageos-vcpfx 0/1 Running 0 1m
storageos-w98f5 0/1 Running 0 1m
If the pods are not READY, the service will not forward traffic to the API they serve hence PVC will remain in pending state until Ondat pods are available.
Kubernetes keeps trying to execute the action until it succeeds. If a PVC is created before Ondat finish starting, the PVC will be created eventually.
Solution:
- Ondat health check takes 60 seconds of grace before reporting as READY. If Ondat is starting properly after that period, the volume will be created when Ondat finishes its bootstrap.
- If Ondat is not running or is not starting properly, the solution would be to troubleshoot the installation.
PVC pending state - Secret Missing
A created PVC remains in pending state making pods that need to mount that PVC unable to start.
Issue:
kubectl describe pvc $PVC
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 13s (x2 over 28s) persistentvolume-controller Failed to provision volume with StorageClass "storageos": failed to get secret from ["storageos"/"storageos-api"]
Reason:
For non CSI installations of Ondat, Kubernetes uses the Ondat API endpoint to communicate. If that communication fails, relevant actions such as create or mount a volume can’t be transmitted to Ondat, and the PVC will remain in pending state. Ondat never received the action to perform, so it never sent back an acknowledgement.
The StorageClass provisioned for Ondat references a Secret from where it retrieves the API endpoint and the authentication parameters. If that secret is incorrect or missing, the connections won’t be established. It is common to see that the Secret has been deployed in a different namespace where the StorageClass expects it or that is has been deployed with a different name.
Assert:
-
Check the StorageClass parameters to know where the Secret is expected to be found.
$ kubectl get storageclass storageos -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: storageos provisioner: csi.storageos.com allowVolumeExpansion: true parameters: csi.storage.k8s.io/fstype: ext4 storageos.com/replicas: "1" csi.storage.k8s.io/secret-name: storageos-api csi.storage.k8s.io/secret-namespace: storageos
Note that the parameters specify
secret-namespace
andsecret-name
. -
Check if the secret exists according to those parameters
kubectl -n storageos get secret storageos-api No resources found. Error from server (NotFound): secrets "storageos-api" not found
If no resources are found, it is clear that the Secret doesn’t exist or it is not deployed in the right location.
Solution:
Deploy Ondat following the installation procedures. If you are using the manifests provided for Kubernetes to deploy Ondat rather than using automated provisioners, make sure that the StorageClass parameters and the Secret reference match.
Peer discovery - Pod allocation
Issue:
Ondat nodes can’t join the cluster and show the following log entries.
time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp
Reason:
Ondat uses a gossip protocol to discover the nodes in the cluster. When
Ondat starts, one or more active nodes must be referenced so new nodes can
query existing nodes for the list of members. This error indicates that the node
can’t connect to any of the nodes in the known list. The known list is defined
in the JOIN
variable.
If there are no active Ondat nodes, the bootstrap process will elect the
first node in the JOIN
variable as master, and the rest will try to
discover from it. In case of that node not starting, the whole cluster will
remain unable to bootstrap.
Installations of Ondat use a DaemonSet, and by default do not schedule
Ondat pods to master nodes, due to the presence of the
node-role.kubernetes.io/master:NoSchedule
taint that is typically present. In
such cases the JOIN
variable must not contain master nodes or the Ondat
cluster will remain unable to start.
Assert:
Check that the first node of the JOIN
variable started properly.
root@node1:~/# kubectl -n kube-system describe ds/storageos | grep JOIN
JOIN: 172.28.128.3,172.28.128.4,172.28.128.5
root@node1:~/# kubectl -n kube-system get pod -o wide | grep 172.28.128.3
storageos-8zqxl 1/1 Running 0 2m 172.28.128.3 node1
Solution:
Make sure that the JOIN
variable doesn’t specify the master nodes. In case
you are using the discovery service, it is necessary to ensure that the
DaemonSet won’t allocate Pods on the masters. This can be achieved with taints,
node selectors or labels.
For installations with the Ondat operator you can specify which nodes to deploy Ondat on using nodeSelectors. See examples in the Cluster Operator Examples page.
For more advanced installations using compute-only and storage nodes, check the
storageos.com/deployment=computeonly
label that can be added to the nodes
through Kubernetes node labels, or Ondat in the Labels page.
LIO Init:Error
Issue:
Ondat pods not starting with Init:Error
kubectl -n kube-system get pod
NAME READY STATUS RESTARTS AGE
storageos-2kwqx 0/3 Init:Err 0 6s
storageos-cffcr 0/3 Init:Err 0 6s
storageos-d4f69 0/3 Init:Err 0 6s
storageos-nhq7m 0/3 Init:Err 0 6s
Reason:
This indicates that since the Linux open source SCSI drivers are not enabled, Ondat cannot start. The Ondat DaemonSet enables the required kernel modules on the host system. If you are seeing these errors it is because that container couldn’t load the modules.
Assert
Check the logs of the init container.
kubectl -n kube-system logs $ANY_STORAGEOS_POD -c storageos-init
In case of failure, it will show the following output, indicating which kernel modules couldn’t be loaded or that they are not properly configured:
Checking configfs
configfs mounted on sys/kernel/config
Module target_core_mod is not running
executing modprobe -b target_core_mod
Module tcm_loop is not running
executing modprobe -b tcm_loop
modprobe: FATAL: Module tcm_loop not found.
Solution:
Install the required kernel modules (usually found in the
linux-image-extra-$(uname -r)
package of your distribution) on your nodes
following this prerequisites page and delete Ondat
pods, allowing the DaemonSet to create the pods again.
LIO not enabled
Issue:
Ondat node can’t start and shows the following log entries.
time="2018-09-24T14:34:40Z" level=error msg="liocheck returned error" category=liocheck error="exit status 1" module=dataplane stderr="Sysfs root '/sys/kernel/config/target' is missing, is kernel configfs present and target_core_mod loaded? category=fslio level=warn\nRuntime error checking stage 'target_core_mod': SysFs root missing category=fslio level=warn\nliocheck: FAIL (lio_capable_system() returns failure) category=fslio level=fatal\n" stdout=
time="2018-09-24T14:34:40Z" level=error msg="failed to start dataplane services" error="system dependency check failed: exit status 1" module=command
Reason:
This indicates that one or more kernel modules required for Ondat are not loaded.
Assert
The following kernel modules must be enabled in the host.
lsmod | egrep "^tcm_loop|^target_core_mod|^target_core_file|^configfs"
Solution:
Install the required kernel modules (usually found in the
linux-image-extra-$(uname -r)
package of your distribution) on your nodes
following this prerequisites page and restart the container.
(OpenShift) Ondat pods missing – DaemonSet error
Ondat DaemonSet doesn’t have any pod replicas. The DaemonSet couldn’t allocate any Pod due to security issues.
Issue:
[root@master02 standard]# oc get pod
No resources found.
[root@master02 standard]# oc describe daemonset storageos
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 0s (x12 over 10s) daemonset-controller Error creating: pods "storageos-" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used provider restricted: .spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed capabilities.add: Invalid value: "SYS_ADMIN": capability may not be added spec.initContainers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.initContainers[0].securityContext.containers[0].hostPort: Invalid value: 5705: Host ports are not allowed to be used spec.initContainers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed capabilities.add: Invalid value: "SYS_ADMIN": capability may not be added spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 5705: Host ports are not allowed to be used spec.containers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used]
Reason:
The OpenShift cluster has security context constraint policies enabled that forbid any pod, without an explicitly set policy for the service account, to be allocated.
Assert:
Check if the Ondat ServiceAccount can create pods with enough permissions
oc get scc privileged -o yaml # Or custom scc with enough privileges
(...)
users:
- system:admin
- system:serviceaccount:openshift-infra:build-controller
- system:serviceaccount:management-infra:management-admin
- system:serviceaccount:management-infra:inspector-admin
- system:serviceaccount:storageos:storageos <--
- system:serviceaccount:tiller:tiller
If the Ondat sa system:serviceaccount:storageos:storageos is in the privileged scc it will be able to create pods.
Solution:
Add the ServiceAccount system:serviceaccount:storageos:storageos to a scc with enough privileges.
oc adm policy add-scc-to-user privileged system:serviceaccount:storageos:storageos
Getting Help
If our troubleshooting guides do not help resolve your issue, please see our support section for details on how to get in touch with us.