Rants On NFS Lack of File Handle Visibility To Sysadm
NFS is a not-so-recent solution to share filesystem across linux nodes. It have some capability that are currently indispensable for Linux Clusters : to lock files across nodes and allow either exclusive or non-exclusive access to the same file.
Fault Tolerance / Recovery
I have read some papers on NFS, it should be able to recover a restarting host / server. Unfortunately in several occassion we found this to be not quite true, after a host serving NFS being restarted, we have stale handle errors in the client. The workaround is to restart NFS client, and if that still doesn't fix the situation, restart NFS server. In our cases sometimes we need to restart twice across the cluster (because the client hangs running a program over NFS). Some might said program shouldn't be run over NFS (and only data files should) but we have deployed a SAP documented cluster architecture that requires such use of NFS.
Locks
When a file were locked in the NFS, a lock is being created in the host and the second one is also being created in the client. The problem occured when the lock is exclusive and we don't know which container in which nodes are locking the file. So in that case we have a specific file and we are wondering which process having the file lock. In our openshift cluster, we have 9 worker nodes each with capacity of 160 containers, and each and every container have a possibility to hold such locks. On the other scenario we want to find which files are being locked on the server hosting the NFS share. Why can't there is a kernel API to show the path of NFS locks on ZFS filesystem.. lslocks can't tell us anything about the file path, presumably caused by combination of the filesystem (ZFS) and the lock process is actually in remote host. In previous post I showed zdb -dddd to check file path, but alas in the 7.5.x ZFS this doesn't show any path.
Comments