-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
When a NAS backup repository becomes temporarily unreachable, running VMs on the KVM host can be affected — including VMs that have no relationship to the backup operation. This is because the NFS mount used by nasbackup.sh defaults to hard mode, which blocks indefinitely when the NFS server is unresponsive.
Root Cause
The backup_repository.mount_opts column defaults to empty. When nasbackup.sh calls mount_operation(), it mounts the backup NFS share with no special options, which defaults to NFS hard mode:
# nasbackup.sh mount_operation()
mount -t ${NAS_TYPE} ${NAS_ADDRESS} ${mount_point} $([[ ! -z "${MOUNT_OPTS}" ]] && echo -o ${MOUNT_OPTS})With hard mode, any I/O operation on the NFS mount blocks indefinitely when the server is unreachable. This causes a cascade:
nasbackup.shhangs on NFS I/O (write, sync, or umount)- The CloudStack agent is blocked because
nasbackup.shruns as a child process of the agent JVM - The blocked agent cannot process any VM operations (PlugNic, Stop, Migrate, etc.) — all commands queue behind the stuck backup
- On the host kernel level, NFS
hardmount stalls can cause I/O waits that affect all processes, including QEMU instances for unrelated VMs - VMs experience I/O timeouts — Windows guests BSOD with
KERNEL_DATA_INPAGE_ERROR, Linux guests may freeze
Evidence (from production CloudStack 4.20)
NFS server 172.16.3.63 experienced intermittent connectivity issues. Host dmesg showed repeated cycles:
nfs: server 172.16.3.63 not responding, still trying
nfs: server 172.16.3.63 OK
nfs: server 172.16.3.63 not responding, still trying
nfs: server 172.16.3.63 OK
Impact:
- A Windows VM (
citytravelsacco, i-2-1651-VM) on the same host crashed with BSOD (KERNEL_DATA_INPAGE_ERROR) even though its disk is on local storage, not NFS - The CloudStack agent was blocked for 3+ hours by a stuck
nasbackup.shprocess, preventing all VM management operations on the host - A NIC hot-plug operation queued for 30+ minutes waiting for the agent to become responsive
Suggested Fix
-
Default
mount_optsfor NAS backup repositories tosoft,timeo=50,retrans=3— this causes NFS operations to fail after ~15 seconds instead of blocking forever. A failed backup is far preferable to crashing production VMs. -
Add a timeout wrapper to
nasbackup.sh— if the entire backup operation exceeds a configurable duration, kill it cleanly (resume paused VM, unmount, exit with error). -
Document the risk — warn administrators that empty
mount_optson NAS backup repositories defaults to NFShardmode, which can cause host-wide I/O stalls.
Note: soft mount may cause backup data corruption if the NFS server recovers mid-write, but this only affects the backup copy — not the production VM. A corrupted backup can be retried; a crashed production VM cannot be un-crashed.
Workaround
Manually set mount options on the backup repository:
UPDATE cloud.backup_repository SET mount_opts='soft,timeo=50,retrans=3' WHERE id=<repo_id>;And update /etc/fstab on KVM hosts if the NFS backup share is persistently mounted:
172.16.3.63:/ACS /tmp/nasbackup nfs soft,timeo=50,retrans=3,_netdev 0 0
Versions
- CloudStack 4.20
- NFS v4.1
- KVM hosts: Debian/Ubuntu with kernel 5.x/6.x