Skip to content

NAS backup: resume paused VM on backup failure and fix missing exit#12822

Open
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:fix/nas-backup-vm-paused-on-failure
Open

NAS backup: resume paused VM on backup failure and fix missing exit#12822
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:fix/nas-backup-vm-paused-on-failure

Conversation

@jmsperu
Copy link

@jmsperu jmsperu commented Mar 17, 2026

Summary

Fixes #12821 — KVM VMs remain indefinitely paused when NAS backup job fails.

When virsh backup-begin executes a push backup, QEMU pauses the domain for a consistent snapshot. If the backup write fails (e.g. NFS storage full), nasbackup.sh calls cleanup() but:

  1. Never resumes the paused VMcleanup() only removes files and unmounts
  2. Never exits the monitoring loop — missing exit after cleanup() in the Failed case causes an infinite loop
  3. Same missing exit in backup_stopped_vm()qemu-img convert failure calls cleanup() but continues processing

Changes

  • cleanup(): Added VM state detection via virsh domstate and automatic virsh resume if the VM is found paused, ensuring the VM is always resumed during error handling
  • backup_running_vm(): Added exit 1 after cleanup() in the Failed backup job case to terminate the infinite monitoring loop
  • backup_stopped_vm(): Added exit 1 after cleanup() on qemu-img convert failure

Evidence

In production, NFS backup storage filling to 100% caused 8 VMs to become paused simultaneously across 3 KVM hosts. Some VMs remained paused for over 6 hours. CloudStack UI showed them as "Running" while they were actually paused at the KVM level, requiring manual virsh resume on each host.

Note

The pattern of checking and resuming paused VMs already exists in the Java layer — see LibvirtBackupSnapshotCommandWrapper.java:186-188 and KVMStorageProcessor.java:2268-2272 — but was missing from the shell script that actually manages the backup lifecycle.

Test plan

  • Trigger NAS backup on a running VM with sufficient storage — verify backup completes and VM stays running
  • Trigger NAS backup with NFS storage at 100% — verify backup fails but VM is resumed automatically
  • Trigger NAS backup on a stopped VM with a bad disk path — verify cleanup exits properly
  • Verify cleanup() correctly resumes VM before removing temp files and unmounting

🤖 Generated with Claude Code

When a NAS backup job fails (e.g. due to backup storage being full or
I/O errors), the VM may remain indefinitely paused because:

1. The cleanup() function never checks or resumes the VM's paused state
   that was set by virsh backup-begin during the push backup operation.

2. The 'Failed' case in the backup job monitoring loop calls cleanup()
   but lacks an 'exit' statement, causing an infinite loop where the
   script repeatedly detects the failed job and calls cleanup().

3. Similarly, backup_stopped_vm() calls cleanup() on qemu-img convert
   failure but does not exit, allowing the loop to continue with
   subsequent disks despite the failure.

This fix:
- Adds VM state detection and resume to cleanup(), ensuring the VM is
  always resumed if found in a paused state during error handling
- Adds missing 'exit 1' after cleanup() in the Failed backup job case
  to prevent the infinite monitoring loop
- Adds missing 'exit 1' after cleanup() in backup_stopped_vm() on
  qemu-img convert failure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@DaanHoogland DaanHoogland added this to the 4.23.0 milestone Mar 17, 2026
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 17.94%. Comparing base (93239e0) to head (30a54d0).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12822      +/-   ##
============================================
- Coverage     17.95%   17.94%   -0.01%     
+ Complexity    16259    16258       -1     
============================================
  Files          5954     5954              
  Lines        534838   534838              
  Branches      65423    65423              
============================================
- Hits          96010    95991      -19     
- Misses       428053   428074      +21     
+ Partials      10775    10773       -2     
Flag Coverage Δ
uitests 3.65% <ø> (ø)
unittests 19.06% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yadvr yadvr requested review from abh1sar and weizhouapache March 17, 2026 11:42
Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

not tested yet

@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17178

@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KVM NAS backup: VM remains paused indefinitely when backup job fails

5 participants