NAS backup: resume paused VM on backup failure and fix missing exit#12822
NAS backup: resume paused VM on backup failure and fix missing exit#12822jmsperu wants to merge 1 commit intoapache:mainfrom
Conversation
When a NAS backup job fails (e.g. due to backup storage being full or I/O errors), the VM may remain indefinitely paused because: 1. The cleanup() function never checks or resumes the VM's paused state that was set by virsh backup-begin during the push backup operation. 2. The 'Failed' case in the backup job monitoring loop calls cleanup() but lacks an 'exit' statement, causing an infinite loop where the script repeatedly detects the failed job and calls cleanup(). 3. Similarly, backup_stopped_vm() calls cleanup() on qemu-img convert failure but does not exit, allowing the loop to continue with subsequent disks despite the failure. This fix: - Adds VM state detection and resume to cleanup(), ensuring the VM is always resumed if found in a paused state during error handling - Adds missing 'exit 1' after cleanup() in the Failed backup job case to prevent the infinite monitoring loop - Adds missing 'exit 1' after cleanup() in backup_stopped_vm() on qemu-img convert failure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #12822 +/- ##
============================================
- Coverage 17.95% 17.94% -0.01%
+ Complexity 16259 16258 -1
============================================
Files 5954 5954
Lines 534838 534838
Branches 65423 65423
============================================
- Hits 96010 95991 -19
- Misses 428053 428074 +21
+ Partials 10775 10773 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
weizhouapache
left a comment
There was a problem hiding this comment.
code lgtm
not tested yet
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17178 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
Summary
Fixes #12821 — KVM VMs remain indefinitely paused when NAS backup job fails.
When
virsh backup-beginexecutes a push backup, QEMU pauses the domain for a consistent snapshot. If the backup write fails (e.g. NFS storage full),nasbackup.shcallscleanup()but:cleanup()only removes files and unmountsexitaftercleanup()in theFailedcase causes an infinite loopbackup_stopped_vm()—qemu-img convertfailure callscleanup()but continues processingChanges
cleanup(): Added VM state detection viavirsh domstateand automaticvirsh resumeif the VM is found paused, ensuring the VM is always resumed during error handlingbackup_running_vm(): Addedexit 1aftercleanup()in theFailedbackup job case to terminate the infinite monitoring loopbackup_stopped_vm(): Addedexit 1aftercleanup()onqemu-img convertfailureEvidence
In production, NFS backup storage filling to 100% caused 8 VMs to become paused simultaneously across 3 KVM hosts. Some VMs remained paused for over 6 hours. CloudStack UI showed them as "Running" while they were actually paused at the KVM level, requiring manual
virsh resumeon each host.Note
The pattern of checking and resuming paused VMs already exists in the Java layer — see
LibvirtBackupSnapshotCommandWrapper.java:186-188andKVMStorageProcessor.java:2268-2272— but was missing from the shell script that actually manages the backup lifecycle.Test plan
cleanup()correctly resumes VM before removing temp files and unmounting🤖 Generated with Claude Code