Skip to content

KVM NAS backup: hard NFS mount default causes VM outages when backup storage is unreachable #12829

@jmsperu

Description

@jmsperu

Description

When a NAS backup repository becomes temporarily unreachable, running VMs on the KVM host can be affected — including VMs that have no relationship to the backup operation. This is because the NFS mount used by nasbackup.sh defaults to hard mode, which blocks indefinitely when the NFS server is unresponsive.

Root Cause

The backup_repository.mount_opts column defaults to empty. When nasbackup.sh calls mount_operation(), it mounts the backup NFS share with no special options, which defaults to NFS hard mode:

# nasbackup.sh mount_operation()
mount -t ${NAS_TYPE} ${NAS_ADDRESS} ${mount_point} $([[ ! -z "${MOUNT_OPTS}" ]] && echo -o ${MOUNT_OPTS})

With hard mode, any I/O operation on the NFS mount blocks indefinitely when the server is unreachable. This causes a cascade:

  1. nasbackup.sh hangs on NFS I/O (write, sync, or umount)
  2. The CloudStack agent is blocked because nasbackup.sh runs as a child process of the agent JVM
  3. The blocked agent cannot process any VM operations (PlugNic, Stop, Migrate, etc.) — all commands queue behind the stuck backup
  4. On the host kernel level, NFS hard mount stalls can cause I/O waits that affect all processes, including QEMU instances for unrelated VMs
  5. VMs experience I/O timeouts — Windows guests BSOD with KERNEL_DATA_INPAGE_ERROR, Linux guests may freeze

Evidence (from production CloudStack 4.20)

NFS server 172.16.3.63 experienced intermittent connectivity issues. Host dmesg showed repeated cycles:

nfs: server 172.16.3.63 not responding, still trying
nfs: server 172.16.3.63 OK
nfs: server 172.16.3.63 not responding, still trying
nfs: server 172.16.3.63 OK

Impact:

  • A Windows VM (citytravelsacco, i-2-1651-VM) on the same host crashed with BSOD (KERNEL_DATA_INPAGE_ERROR) even though its disk is on local storage, not NFS
  • The CloudStack agent was blocked for 3+ hours by a stuck nasbackup.sh process, preventing all VM management operations on the host
  • A NIC hot-plug operation queued for 30+ minutes waiting for the agent to become responsive

Suggested Fix

  1. Default mount_opts for NAS backup repositories to soft,timeo=50,retrans=3 — this causes NFS operations to fail after ~15 seconds instead of blocking forever. A failed backup is far preferable to crashing production VMs.

  2. Add a timeout wrapper to nasbackup.sh — if the entire backup operation exceeds a configurable duration, kill it cleanly (resume paused VM, unmount, exit with error).

  3. Document the risk — warn administrators that empty mount_opts on NAS backup repositories defaults to NFS hard mode, which can cause host-wide I/O stalls.

Note: soft mount may cause backup data corruption if the NFS server recovers mid-write, but this only affects the backup copy — not the production VM. A corrupted backup can be retried; a crashed production VM cannot be un-crashed.

Workaround

Manually set mount options on the backup repository:

UPDATE cloud.backup_repository SET mount_opts='soft,timeo=50,retrans=3' WHERE id=<repo_id>;

And update /etc/fstab on KVM hosts if the NFS backup share is persistently mounted:

172.16.3.63:/ACS /tmp/nasbackup nfs soft,timeo=50,retrans=3,_netdev 0 0

Versions

  • CloudStack 4.20
  • NFS v4.1
  • KVM hosts: Debian/Ubuntu with kernel 5.x/6.x

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions