KVM NAS backup: hard NFS mount default causes VM outages when backup storage is unreachable

## Description

When a NAS backup repository becomes temporarily unreachable, running VMs on the KVM host can be affected — including VMs that have no relationship to the backup operation. This is because the NFS mount used by `nasbackup.sh` defaults to `hard` mode, which blocks indefinitely when the NFS server is unresponsive.

## Root Cause

The `backup_repository.mount_opts` column defaults to empty. When `nasbackup.sh` calls `mount_operation()`, it mounts the backup NFS share with no special options, which defaults to NFS `hard` mode:

```bash
# nasbackup.sh mount_operation()
mount -t ${NAS_TYPE} ${NAS_ADDRESS} ${mount_point} $([[ ! -z "${MOUNT_OPTS}" ]] && echo -o ${MOUNT_OPTS})
```

With `hard` mode, any I/O operation on the NFS mount blocks indefinitely when the server is unreachable. This causes a cascade:

1. `nasbackup.sh` hangs on NFS I/O (write, sync, or umount)
2. The CloudStack agent is blocked because `nasbackup.sh` runs as a child process of the agent JVM
3. The blocked agent cannot process **any** VM operations (PlugNic, Stop, Migrate, etc.) — all commands queue behind the stuck backup
4. On the host kernel level, NFS `hard` mount stalls can cause I/O waits that affect all processes, including QEMU instances for unrelated VMs
5. VMs experience I/O timeouts — Windows guests BSOD with `KERNEL_DATA_INPAGE_ERROR`, Linux guests may freeze

## Evidence (from production CloudStack 4.20)

NFS server 172.16.3.63 experienced intermittent connectivity issues. Host dmesg showed repeated cycles:
```
nfs: server 172.16.3.63 not responding, still trying
nfs: server 172.16.3.63 OK
nfs: server 172.16.3.63 not responding, still trying
nfs: server 172.16.3.63 OK
```

Impact:
- A Windows VM (`citytravelsacco`, i-2-1651-VM) on the same host **crashed with BSOD** (`KERNEL_DATA_INPAGE_ERROR`) even though its disk is on **local storage**, not NFS
- The CloudStack agent was blocked for 3+ hours by a stuck `nasbackup.sh` process, preventing all VM management operations on the host
- A NIC hot-plug operation queued for 30+ minutes waiting for the agent to become responsive

## Suggested Fix

1. **Default `mount_opts` for NAS backup repositories to `soft,timeo=50,retrans=3`** — this causes NFS operations to fail after ~15 seconds instead of blocking forever. A failed backup is far preferable to crashing production VMs.

2. **Add a timeout wrapper to `nasbackup.sh`** — if the entire backup operation exceeds a configurable duration, kill it cleanly (resume paused VM, unmount, exit with error).

3. **Document the risk** — warn administrators that empty `mount_opts` on NAS backup repositories defaults to NFS `hard` mode, which can cause host-wide I/O stalls.

Note: `soft` mount may cause backup data corruption if the NFS server recovers mid-write, but this only affects the backup copy — not the production VM. A corrupted backup can be retried; a crashed production VM cannot be un-crashed.

## Workaround

Manually set mount options on the backup repository:
```sql
UPDATE cloud.backup_repository SET mount_opts='soft,timeo=50,retrans=3' WHERE id=<repo_id>;
```

And update `/etc/fstab` on KVM hosts if the NFS backup share is persistently mounted:
```
172.16.3.63:/ACS /tmp/nasbackup nfs soft,timeo=50,retrans=3,_netdev 0 0
```

## Versions

- CloudStack 4.20
- NFS v4.1
- KVM hosts: Debian/Ubuntu with kernel 5.x/6.x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVM NAS backup: hard NFS mount default causes VM outages when backup storage is unreachable #12829

Description

Root Cause

Evidence (from production CloudStack 4.20)

Suggested Fix

Workaround

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KVM NAS backup: hard NFS mount default causes VM outages when backup storage is unreachable #12829

Description

Description

Root Cause

Evidence (from production CloudStack 4.20)

Suggested Fix

Workaround

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions