NAS backup: resume paused VM on backup failure and fix missing exit by jmsperu · Pull Request #12822 · apache/cloudstack

jmsperu · 2026-03-17T03:56:39Z

Summary

Fixes #12821 — KVM VMs remain indefinitely paused when NAS backup job fails.

When virsh backup-begin executes a push backup, QEMU pauses the domain for a consistent snapshot. If the backup write fails (e.g. NFS storage full), nasbackup.sh calls cleanup() but:

Never resumes the paused VM — cleanup() only removes files and unmounts
Never exits the monitoring loop — missing exit after cleanup() in the Failed case causes an infinite loop
Same missing exit in backup_stopped_vm() — qemu-img convert failure calls cleanup() but continues processing

Changes

cleanup(): Added VM state detection via virsh domstate and automatic virsh resume if the VM is found paused, ensuring the VM is always resumed during error handling
backup_running_vm(): Added exit 1 after cleanup() in the Failed backup job case to terminate the infinite monitoring loop
backup_stopped_vm(): Added exit 1 after cleanup() on qemu-img convert failure

Evidence

In production, NFS backup storage filling to 100% caused 8 VMs to become paused simultaneously across 3 KVM hosts. Some VMs remained paused for over 6 hours. CloudStack UI showed them as "Running" while they were actually paused at the KVM level, requiring manual virsh resume on each host.

Note

The pattern of checking and resuming paused VMs already exists in the Java layer — see LibvirtBackupSnapshotCommandWrapper.java:186-188 and KVMStorageProcessor.java:2268-2272 — but was missing from the shell script that actually manages the backup lifecycle.

Test plan

Trigger NAS backup on a running VM with sufficient storage — verify backup completes and VM stays running
Trigger NAS backup with NFS storage at 100% — verify backup fails but VM is resumed automatically
Trigger NAS backup on a stopped VM with a bad disk path — verify cleanup exits properly
Verify cleanup() correctly resumes VM before removing temp files and unmounting

🤖 Generated with Claude Code

When a NAS backup job fails (e.g. due to backup storage being full or I/O errors), the VM may remain indefinitely paused because: 1. The cleanup() function never checks or resumes the VM's paused state that was set by virsh backup-begin during the push backup operation. 2. The 'Failed' case in the backup job monitoring loop calls cleanup() but lacks an 'exit' statement, causing an infinite loop where the script repeatedly detects the failed job and calls cleanup(). 3. Similarly, backup_stopped_vm() calls cleanup() on qemu-img convert failure but does not exit, allowing the loop to continue with subsequent disks despite the failure. This fix: - Adds VM state detection and resume to cleanup(), ensuring the VM is always resumed if found in a paused state during error handling - Adds missing 'exit 1' after cleanup() in the Failed backup job case to prevent the infinite monitoring loop - Adds missing 'exit 1' after cleanup() in backup_stopped_vm() on qemu-img convert failure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-17T09:18:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 17.94%. Comparing base (93239e0) to head (30a54d0).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #12822      +/-   ##
============================================
- Coverage     17.95%   17.94%   -0.01%     
+ Complexity    16259    16258       -1     
============================================
  Files          5954     5954              
  Lines        534838   534838              
  Branches      65423    65423              
============================================
- Hits          96010    95991      -19     
- Misses       428053   428074      +21     
+ Partials      10775    10773       -2

Flag	Coverage Δ
uitests	`3.65% <ø> (ø)`
unittests	`19.06% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

weizhouapache

code lgtm

not tested yet

sureshanaparti · 2026-03-17T16:19:08Z

@blueorangutan package

blueorangutan · 2026-03-17T16:20:05Z

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2026-03-17T17:14:00Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17178

sureshanaparti · 2026-03-17T17:51:43Z

@blueorangutan test

blueorangutan · 2026-03-17T17:54:04Z

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

boring-cyborg bot added the component:kvm label Mar 17, 2026

DaanHoogland added this to the 4.23.0 milestone Mar 17, 2026

yadvr requested review from abh1sar and weizhouapache March 17, 2026 11:42

weizhouapache approved these changes Mar 17, 2026

View reviewed changes

DaanHoogland added the status:needs-testing label Mar 17, 2026

sureshanaparti approved these changes Mar 17, 2026

View reviewed changes

sureshanaparti added the ready-for-testing label Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAS backup: resume paused VM on backup failure and fix missing exit#12822

NAS backup: resume paused VM on backup failure and fix missing exit#12822
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:fix/nas-backup-vm-paused-on-failure

jmsperu commented Mar 17, 2026

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

weizhouapache left a comment

Uh oh!

sureshanaparti commented Mar 17, 2026

Uh oh!

blueorangutan commented Mar 17, 2026

Uh oh!

blueorangutan commented Mar 17, 2026

Uh oh!

sureshanaparti commented Mar 17, 2026

Uh oh!

blueorangutan commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jmsperu commented Mar 17, 2026

Summary

Changes

Evidence

Note

Test plan

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

weizhouapache left a comment

Choose a reason for hiding this comment

Uh oh!

sureshanaparti commented Mar 17, 2026

Uh oh!

blueorangutan commented Mar 17, 2026

Uh oh!

blueorangutan commented Mar 17, 2026

Uh oh!

sureshanaparti commented Mar 17, 2026

Uh oh!

blueorangutan commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Mar 17, 2026 •

edited

Loading