Disk Full
When the physical disks in your cluster get full, or start using disk space too
fast, Kubernetes will automatically apply the DiskPressure:NoExecute
taint.
The Ondat Daemonset has a toleration applied for this taint, so it will continue to run.
Detecting a full disk
You will see the following when viewing the Ondat container logs on the
affected node with kubectl logs storageos-daemonset-xxxx -c storageos
:
"msg": "StartAsyncFallocate: insufficient free space on file system free_space=1048305664 required_free_space=1073741824",
"msg": "Write: write failed volid=155051 error=BlobStorage::PrepWrite encountered a previous IO error preventing future IO for safety",
"msg": "Write: write failed volid=155051 error=all blob files are full - can not complete write",
"msg": "Write: write failed volid=155051 error=BlobStorage::PrepWrite encountered a previous IO error preventing future IO for safety",
"msg": "SCSI command failed type=write error=FATAL retries=0 time_to_deadline_secs=89",
When trying to attach a volume to a node with a full disk you will see the
following when running kubectl describe pod $POD_USING_VOLUME
:
AttachVolume.Attach failed for volume "pvc-xxx" : rpc error: code = Internal desc = internal error: rpc error: code = Internal desc = rpc error: code = Internal desc = fs: STATUS_FORBIDDEN: create failed in Notify handler error=Failed to create LUN for FsConfigVolume{volume_id=
Consequences of a full disk
Volume Provisioning
When provisioning a new volume, the dataplane checks that there is at least
1GB
of space left on the nodes' underlying filesystem for the blob files
located at /var/lib/storageos/data/dev[0-9]+/vol.xxxxxx.0.blob
and
/var/lib/storageos/data/dev[0-9]+/vol.xxxxxx.1.blob
.
If there is insufficient space for both of the blob files that Ondat uses
to store data, then the dataplane fails the volume create
request.
Runtime Access
At runtime, if an attempted write to a blob file returns an ENOSPC
exception,
the dataplane marks the file as full. Once both Ondat blob files in a
volume are marked as full, the dataplane marks the deployment with an error
flag, and all subsequent read/write operations will return an I/O error.
This flag is only stored in memory, therefore, to clear this flag, the Ondat daemonset pod on the affected node must be restarted after remediating the disk space issue.
Replica Failover
A replica on a full disk can’t be failed over successfully.
Recovering from a full disk
To recover from a disk full error, you can either add new storage space into the affected node, or delete existing persistent volume claims`.
N.B. However you choose to resolve the issue, the Ondat Daemonset Pod must be restarted.
This is because the blob files disallow operations at runtime via the previously discussed error flag stored in memory. This flag does not survive boot cycles, so after the pod restarts, volumes can operate normally once more.
Adding New Storage Space
If you choose to recover by expanding your capacity, you have two main options:
- Add new storage devices at
/var/lib/storageos/data/dev[0-9]+
as described here. - Expand the underlying filesystem that Ondat is using as described here.
After expanding capacity remember to restart the Ondat daemonset pod.
Deleting existing PVCs
If you choose to recover by deleting existing persistent volume claims simply:
kubectl delete pvc <PVCOnFullNodeName>
Then restart the Ondat daemonset pod.