Skip to content

Introduce new native backup provider (KNIB)#12758

Open
JoaoJandre wants to merge 8 commits intoapache:mainfrom
scclouds:new-native-backup-provider
Open

Introduce new native backup provider (KNIB)#12758
JoaoJandre wants to merge 8 commits intoapache:mainfrom
scclouds:new-native-backup-provider

Conversation

@JoaoJandre
Copy link
Contributor

@JoaoJandre JoaoJandre commented Mar 6, 2026

Description

This PR adds a new native incremental backup provider for KVM. The design document which goes into details of the implementation can be found on https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622120.

The validation process which is detailed in the design document will be added to this PR soon. The validation process is already in this PR.
The file extraction process will be added in a later PR.

This PR adds a few new APIs:

  • The createNativeBackupOffering API has the following parameters:

    Parameter Description Default Value Required
    name Specifies the name of the offering - Yes
    compress Specifies whether the offering supports backup compression false No
    validate Specifies whether the offering supports backup validation false No
    allowQuickRestore Specifies whether the offering supports quick restore false No
    allowExtractFile Specifies whether the offering supports file extraction from backups false No
    backupchainsize Backup chain size for backups created with this offering. If this is set, it overrides the backup.chain.size setting - No
    compressionlibrary Specifies the compression library for offerings that support compression. Accepted values are zstd and zlib. By default, zstd is used for images that support it. If the image only supports zlib, it will be used regardless of this parameter. zstd No
  • The deleteNativeBackupOffering API has the following parameter:

    Parameter Description Required
    id Identifier of the native backup offering. Yes

    A native backup offering can only be removed if it is not currently imported.

  • The listNativeBackupOfferings API has the following parameters:

    Parameter Description Required
    id Identifier of the offering. No
    compress Lists only offerings that support backup compression. No
    validate Lists only offerings that support backup validation. No
    allowQuickRestore Lists only offerings that support quick restore. No
    allowExtractFile Lists only offerings that support file extraction from backups. No
    showRemoved Lists also offerings that have already been removed. false No
    By default, lists all offerings that have not been removed.
  • The listBackupServiceJobs has the following parameters

    Parameter Description
    id List only the job with the specified ID
    backupid List jobs associated with the specified backup
    hostid List jobs associated with the specified host. When this parameter is provided, the executing parameter is implicit
    zoneid List jobs associated with the specified zone
    type List jobs of the specified type. Accepts StartCompression, FinalizeCompression and BackupValidation
    executing List jobs that are currently executing
    scheduled List jobs scheduled to run in the future

    By default, lists all jobs that have not been removed.

  • The API downloadValidationScreenshot was added to allow downloading the images generated during the screenshot step. The API has the following parameter:

    Parameter Description
    backupId ID of the backup whose screenshot should be downloaded

The PR also adds parameters to the following APIs:

  • The isolated parameter was added to the createBackup and createBackupSchedule APIs
  • The quickRestore parameter was added to the restoreBackup, restoreVolumeFromBackupAndAttachToVM and createVMFromBackup APIs
  • The hostId parameter was added to the restoreBackup and restoreVolumeFromBackupAndAttachToVM APIs, which can only be used by root admins and only when quick restore is true.

New settings were also added:

Configuration Description Default Value
backup.chain.size Determines the max size of a backup chain. If cloud admins set it to 1 , all the backups will be full backups. With values lower than 1, the backup chain will be unlimited, unless it is stopped by another process. Please note that unlimited backup chains have a higher chance of getting corrupted, as new backups will be dependent on all of the older ones. 8
knib.timeout Timeout, in seconds, to execute KNIB commands. After the command times out, the Management Server will still wait for another knib.timeout seconds to receive a response from the Agent. 43200
backup.compression.task.enabled Determines whether the task responsible for scheduling compression jobs is active. If not, compression jobs will not run true
backup.compression.max.concurrent.operations.per.host Maximum number of concurrent compression jobs. Compression finalization jobs ignore this setting 5
backup.compression.max.concurrent.operations Maximum number of compression jobs that can be executed at the same time in the zone. Values lower than 1 disable the limit. 0
backup.compression.max.job.retries Maximum number of attempts for executing compression jobs 2
backup.compression.retry.interval Interval, in minutes, between attempts to run compression jobs 60
backup.compression.timeout Timeout, in seconds, for running compression jobs 28800
backup.compression.minimum.free.storage Minimum required available storage to start the backup compression process. This setting accepts a real number that is multiplied by the total size of the backup to determine the necessary available space. By default, the storage must have the same amount of available space as the space occupied by the backup. 1
backup.compression.coroutines Number of coroutines used for the compression process, each coroutine has its own thread 1
backup.compression.rate.limit Compression rate limit, in MB/s. Values less than 1 disable the limit 0
backup.validation.task.enabled Determines whether the task responsible for scheduling validation jobs is active. If it is not active, validation jobs will not run. true
backup.validation.interval Interval, in hours, between two validations of the same backup 24
backup.validation.max.concurrent.operations Maximum number of validation jobs that can be executed at the same time in the zone. Values lower than 1 disable the limit. 0
backup.validation.max.concurrent.operations.per.host Maximum number of validation jobs that can be executed at the same time on each host. Values lower than 1 disable the limit. 1
backup.validation.boot.default.timeout Default timeout, in seconds, for the boot validation step 240
backup.validation.script.default.timeout Default timeout, in seconds, for the script validation step 60
backup.validation.screenshot.default.wait Default waiting time, in seconds, before executing the VM screenshot 60
backup.validation.end.chain.on.fail If true, ends the current backup chain if backup validation fails and the backup belongs to that chain. true
enforce.resource.limit.on.backup.validation.vm If true, the creation of validation VMs is bound by the account/domain resource limits. false

Two new host-level configurations, configured in agent.properties, were added: backup.validation.max.concurrent.operations.per.host and backup.compression.max.concurrent.operations.per.host. When configured, they override the values set in the global configurations with the same names.

Six new VM configurations were added:

Configuration Description
backupValidationCommand Command to be executed during the command execution step. The step will not run if this configuration does not have a value
backupValidationCommandArguments Arguments to be passed to the command executed during the command execution step
backupValidationCommandExpectedResult Expected result of the command execution, coded in Base64. If not provided, the process exit code will be checked for value 0
backupValidationCommandTimeout Timeout for the command executed during the command execution step. Overrides the global configuration
backupValidationScreenshotWait Waiting time before executing the screenshot during the screenshot step. Overrides the global configuration
backupValidationBootTimeout Timeout for the VM boot. Overrides the global configuration

By default, the VM configurations backupValidationCommandTimeout, backupValidationScreenshotWait, and backupValidationBootTimeout are read-only for regular users. This behavior can be changed through the global configuration user.vm.readonly.details.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Tests related to disk-only VM snapshots

N Test Result
1 Take disk-only VM snapshot ok
2 Take disk-only VM snapshot again ok
3 Stop VM, revert to snapshot 2, start VM Correct deltas found in the VM volume chain
4 Stop VM, revert to snapshot 1, start VM Correct deltas found in the VM volume chain
5 Take disk-only VM snapshot ok
6 Remove disk-only VM snapshot 1 Marked as removed, not removed from storage
7 Remove disk-only VM snapshot 3 Merged with current volume
8 Remove disk-only VM snapshot 2 Removed, snap 1 merged with the current volume

Basic tests with backup

Using backup.chain.size=3

N Test Result
1 With the VM stopped, I created a backup (b1) Full backup created
2 Started VM, wrote data, created a second backup (b2) Incremental backup created
3 Stopped the VM, went back to backup 1, started Ok, VM without data
4 Stopped the VM, went back to backup 2, started Ok, data from the backup created in test 2 present
5 Created 4 backups (b3, b4, b5, b6) b3 was a full, b4 and b5 incremental, b6 full
6 Removing the last backup (b6) the delta on the primary was merged with the volume
7 Removed backups b4 and b5 they were marked as removed, but not deleted from storage
8 Batch removing the remaining backups Ok, all removed
9 Created a new backup ok
10 Detached the VM from the offer Deltas were merged on primary
11 Removing this last backup ok

Interactions with other functionalities

I created a new VM with a root disk and a data disk for the tests below.

N Test Result
1 Took a new backup and migrated the VM ok
2 Migrated the VM + one of the volumes ok, the migrated volume had no delta on the primary, the other volume still had a delta
3 Took a new backup For the volume that was not migrated the backup was incremental, for the migrated volume, it was a full backup
4 I took 2 backups OK, the finished normally
5 Try restoring one of the backups from before the migration OK
6 Created file 1, created backup b1 OK
7 Created file 2, created VM snap s1 OK
8 Created file 3, created VM snap s2 OK
9 Created file 4, created backup b2 OK
10 Created file 5, created backup b3 OK
11 Stopped the VM, restored VM snap s1, started Files 1 and 2 present
12 Stopped the VM, restored VM snap s2, started Files 1, 2 and 3 present
13 Removed VM snapshots ok
14 Restored backup b1, started file 1 present
15 Restored backup b2, started files 1, 2, 3 and 4 present
16 Restored backup b3, started files 1, 2, 3, 4 and 5 present
17 Took a new backup b4 ok
18 Attached a new volume, wrote data, took a backup b5 ok
19 Stopped the VM, restored backup b4, started the VM the new volume was not affected by the restoration
20 Detached the volume, restored backup b5 a new volume was created and attached to the VM, the files were there
21 Created a backup ok
22 Created a volume snapshot OK
23 Revert volume snapshot I verified that the delta on the primary left by the last backup was removed

Configuration Tests

  • I changed the value of the backup.compression.task.enabled setting and verified that no new jobs were started. I verified that when returned to true, they were executed.
  • I changed the value of the backup.compression.max.concurrent.compressions.per.host setting and verified that the number of jobs executed simultaneously for each host was relative to the value of the setting. I also verified that the value -1 does not limit the number of jobs executed by the host.
  • I verified that the number of retries respects the backup.compression.max.job.retries setting.
  • I verified that the time between retries respects the backup.compression.retry.interval setting.
  • I changed the value of the backup.compression.minimum.free.storage setting and verified that the job failed if there was not enough free space.
  • I changed the value of the backup.compression.coroutines setting and verified that the value passed to qemu-img was reflected.
  • I changed the value of the backup.compression.rate.limit setting and verified that the value was passed to qemu-img.

Compression Tests

Tests performed with an offer that provides compressed backups support

Test Result
Create full backup Backup created and compressed
Create incremental backup Backup created and compressed
Create 10 backups of the same machine Backups created sequentially, but compressed in parallel

Validation tests

Test Case Result
Attach a backup offer with screenshot validation and perform a backup Backup completed successfully
Attach a backup offer with wait for boot validation and perform a backup Boot was detected and backup completed successfully
Attach a backup offer with wait for boot and command execution validation, but do not specify a command, then perform a backup Backup completed successfully, but no command was executed
Attach a backup offer with wait for boot and command execution validation, configuring the command echo Backup completed successfully
Attach a backup offer with wait for boot and command execution validation, configuring the command echo with argument ababa Backup completed successfully
Attach a backup offer with wait for boot and command execution validation, configuring the command echo, argument ababa, and expected output absdui Backup failed, expected output YWJhYmEK
Attach a backup offer with wait for boot and command execution validation, configuring the command echo, argument ababa, and expected output YWJhYmEK Backup completed successfully
Attach a backup offer with screenshot, wait for boot, and command execution validation, configuring the command echo, argument ababa, and expected output YWJhYmEK Backup completed successfully
Download a validation screenshot URL generated and download successful
Attempt to download a screenshot from a backup without a screenshot URL was not generated

Tests with restoreVolumeFromBackupAndAttachToVM

N Test Result
1 Restore volume A from VM 1 to the same VM while it is stopped New volume created, restored, and attached
2 Restore volume B from VM 1 to the same VM while it is running New volume created, restored, and attached
3 Restore volume B from VM 1 to VM 2 while it is running New volume created, restored, and attached
4 Restore volume A from VM 1 to VM 2 while it is running, even though the VM was deleted New volume created, restored, and attached
5 Restore a volume to a VM using quickrestore New volume created, attached, and consolidated with the backup
6 Restore a volume to a stopped VM using quickrestore and specifying the hostId New volume created, attached, VM started on the specified host, and volume consolidated

Tests with restoreBackup

N Test Result
1 Restore VM without quickrestore OK
2 Restore VM with quickrestore Volumes restored, VM started, and volumes consolidated
3 Restore VM with quickrestore and specifying hostId Volumes restored, VM started on the specified host, and volumes consolidated
4 Detach a volume from the VM and repeat test 3 Detached volume duplicated, attached to the VM, restored, and VM started on the host, volumes consolidated

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 5.59331% with 3612 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.90%. Comparing base (b744824) to head (fe64ef4).

Files with missing lines Patch % Lines
...e/wrapper/LibvirtTakeKnibBackupCommandWrapper.java 0.49% 202 Missing ⚠️
...ervisor/kvm/resource/LibvirtComputingResource.java 4.36% 197 Missing ⚠️
...che/cloudstack/backup/NativeBackupServiceImpl.java 0.00% 166 Missing ⚠️
...stack/backup/NativeBackupServiceJobController.java 0.00% 148 Missing ⚠️
...e/wrapper/LibvirtValidateKnibVmCommandWrapper.java 0.76% 130 Missing ⚠️
.../backup/BackupCompressionServiceJobController.java 0.00% 122 Missing ⚠️
.../src/main/java/com/cloud/vm/UserVmManagerImpl.java 12.90% 105 Missing and 3 partials ⚠️
...tack/backup/BackupValidationServiceController.java 0.00% 102 Missing ⚠️
...apache/cloudstack/storage/backup/BackupObject.java 0.00% 91 Missing ⚠️
...napshot/KvmFileBasedStorageVmSnapshotStrategy.java 0.00% 82 Missing ⚠️
... and 127 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12758      +/-   ##
============================================
- Coverage     18.02%   17.90%   -0.12%     
- Complexity    16450    16501      +51     
============================================
  Files          5968     6032      +64     
  Lines        537086   542119    +5033     
  Branches      65961    66422     +461     
============================================
+ Hits          96819    97078     +259     
- Misses       429347   434090    +4743     
- Partials      10920    10951      +31     
Flag Coverage Δ
uitests 3.53% <ø> (-0.01%) ⬇️
unittests 19.05% <5.59%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yadvr
Copy link
Member

yadvr commented Mar 8, 2026

@JoaoJandre just heads up - my colleagues have been working on an incremental backup feature for NAS B&R (using nbd/qemu bitmap tracking & checkpoints). We're also working on a new Veeam-KVM integration for CloudStack whose PR may may be out soon. My colleagues can further help review and advise on this.

/cc @weizhouapache @abh1sar @shwstppr @sureshanaparti @DaanHoogland @harikrishna-patnala

Just my 2cents on the design & your comments - NAS is more than just NFS, but any (mountable) shared storage such as CephFS, cifs/samba etc. Enterprise users usually don't want to mix using secondary storage with backup repositories, which is why NAS B&R introduced a backup-provider agnostic concept of backup repositories which can be explored by other backup providers.

@JoaoJandre
Copy link
Contributor Author

Just my 2cents on the design & your comments - NAS is more than just NFS, but any (mountable) shared storage such as CephFS, cifs/samba etc.

At the time of writing that part, I believe it was only NFS that was supported. I'll update the relevant part.

Enterprise users usually don't want to mix using secondary storage with backup repositories, which is why NAS B&R introduced a backup-provider agnostic concept of backup repositories which can be explored by other backup providers.

The secondary storage selector feature (introduced in 2023 by #7659) allows you to specialize secondary storages. This PR extended the feature so that you may also create selectors for backups.

@abh1sar
Copy link
Contributor

abh1sar commented Mar 9, 2026

Hi Joao,

This looks promising. Incremental backups, quick restore and file restore features have been missing from CloudStack KVM.

I am having trouble understanding some of the design choices though:

  1. What’s the reason behind strong coupling with secondary storage?

    • I am wondering if the Backup Repository will provide a more flexible alternative. The user would be free to add an external storage server or use the secondary storage by simply adding it as a backup repository? It will be very easy for user to have multiple backup repository attached to multiple backup offerings which can be assigned to instance as required.

      This will also be consistent with other backup providers like Veeam and NAS which have the concept of backup repository.

      The backup repository feature also comes with a separate capacity tracking and email alerts.

    • If a secondary storage is needed just for backup’s purpose, how will it be ensured that templates and snapshots are not copied over to it?

  2. About Qemu compression

    • Have you measured / compared the performance of qemu-img compression with other compression methods?
    • As I understand, qemu-img compresses the qcow2 file at a cluster granularity (usually 64kb). That might not fare well when compared to storage level compression. In production environments, the operator might choose to have compression at the storage layer if they are using an enterprise storage like NetApp. Even something open source like ZFS might perform better than qemu-img compress due to the granularity limitation that qemu compression has.
    • I am making this point because the compression part is introducing a fair bit of complexity due to the interaction with SSVM, and I am just wondering if the gains are worth the trouble and should compression be offloaded to the storage completely.
  3. Do we need a separate backup offering table and api?

    • Why not add column or details to backup_offering or backup_offering_details? Other offerings can also benefit from these settings.
  4. What’s the reason behind using virDomainSnapshotCreate to create backup files and not virDomainBackupBegin like incremental volume snapshots and NAS backup?

    • Did you face any issues with checkpoints and bitmaps?

@JoaoJandre
Copy link
Contributor Author

Hi Joao,

Hello, @abh1sar

This looks promising. Incremental backups, quick restore and file restore features have been missing from CloudStack KVM.

I am having trouble understanding some of the design choices though:

1. What’s the reason behind strong coupling with secondary storage?
   
   * I am wondering if the Backup Repository will provide a more flexible alternative. The user would be free to add an external storage server or use the secondary storage by simply adding it as a backup repository? It will be very easy for user to have multiple backup repository attached to multiple backup offerings which can be assigned to instance as required.

I don't see why we should force the coupling of backup offerings with backup repositories, what is the benefit?

     This will also be consistent with other backup providers like Veeam and NAS which have the concept of backup repository.
     The backup repository feature also comes with a separate capacity tracking and email alerts.

The secondary storage also has both features. Although the capacity is not reported to the users currently.

   * If a secondary storage is needed just for backup’s purpose, how will it be ensured that templates and snapshots are not copied over to it?

The secondary storage selectors feature (introduced in 2023 through #7659) allows you to specialize secondary storages. Quoting from the PR description: "This PR aims to add the possibility to direct resources (Volumes, Templates, Snapshots and ISOs) to a specific secondary storage through rules written in JavaScript that will only affect new allocated resources". For a few years it has been possible to have secondary storages that only receive snapshots or templates for example. This PR introduces the possibility to add selectors for backups, so that you have secondary storages that are specific for backups.

Furthermore, my colleagues are working on a feature to allow using alternative secondary storage solutions, such as CephFS, iSCSI and S3, while preserving compatibility with features destined to NFS storages. This feature may be extended in the future to allow essentially any type of secondary storage. Thus, the flexibility for secondary storages will soon grow.

2. About Qemu compression
   
   * Have you measured / compared the performance of qemu-img compression with other compression methods?

Using any other type of backup-level compression will be worse then using qemu-img compression. This is because when restoring the backup, we must have access to the whole backing chain. If we use other types of compression, we will have to decompress the whole chain before restoring. Using qemu-img, the backing files are still valid and do not need to be decompressed, we actually never have to decompress ever. This is the great benefit of using qemu-img.

In any case, here is a brief comparison of using qemu-img with the zstd library and 8 threads and using the pigz implementation of multi-threaded compression, also using 8 threads. The original file is the root volume of a VM that I use.

Command Time Original file size Final file size
qemu-img convert -c -p -W -m 8 -f qcow2 -O qcow2 -o compression_type=zstd real 3m51.944s - user 16m11.970s - sys 4m14.987s 43 G 35G
pigz -p8 real 6m13.799s - user 44m33.300s - sys 1m54.801s 43G 34G
pigz --zip -p8 real 6m2.729s - user 44m38.401s - sys 1m47.663s 43G 34G

Compression using qemu-img was a lot faster, with a bit smaller compression ratio. Furthermore, we have to consider that the qemu-img compressed image can be used as-is, while the other images must be decompressed, further adding to the processing time of backing up/restoring a backup.

   * As I understand, qemu-img compresses the qcow2 file at a cluster granularity (usually 64kb). That might not fare well when compared to storage level compression. In production environments, the operator might choose to have compression at the storage layer if they are using an enterprise storage like NetApp. Even something open source like ZFS might perform better than qemu-img compress due to the granularity limitation that qemu compression has.

The compression feature is optional, if you are using storage-level compression, you probably will not use backup-level compression. However, many environments do not have storage-level compression, thus having the possibility of backup-level compression is still very interesting.

   * I am making this point because the compression part is introducing a fair bit of complexity due to the interaction with SSVM, and I am just wondering if the gains are worth the trouble and should compression be offloaded to the storage completely.

The compression does not add any interaction with the SSVM.

3. Do we need a separate backup offering table and api?
   
   * Why not add column or details to backup_offering or backup_offering_details? Other offerings can also benefit from these settings.

I did not want to add dozens of parameters to the import backup offering API which are only really going to be used for one provider. This way, the original design of the API is preserved.

Furthermore, you may note that the APIs are intentionally not called createKnibBackupOffering, but createNativeBackupOffering. If other native providers want to use these offerings, they may do so by extending their implementations.

4. What’s the reason behind using virDomainSnapshotCreate to create backup files and not virDomainBackupBegin like incremental volume snapshots and NAS backup?
   
   * Did you face any issues with checkpoints and bitmaps?

There are two main issues with using bitmaps:

  1. They are prone to corruption, while this can be mitigated in some ways, since the incremental volume snapshot feature was added, we have noticed multiple cases of bitmap corruption with different causes. It is possible to detect the corruption and delete corrupt bitmaps, but this would add more complexity to the feature.
  2. Using bitmaps is not compatible with the file-based incremental VM snapshot feature added in File-based disk-only VM snapshot with KVM as hypervisor #10632. After some internal discussion and feedback from users, we have come to the conclusion that being able to use both the incremental VM snapshot and backup features at the same time is very interesting.

At the end of the day, this PR adds a new backup provider option for users. They will be free to choose the provider that best fits their needs. This is one of the reasons why it was done as a new backup provider; KNIB and other backup providers do not have to cancel each-other out.

@abh1sar
Copy link
Contributor

abh1sar commented Mar 13, 2026

Hi @JoaoJandre
Please find my response inline.

I don't see why we should force the coupling of backup offerings with backup repositories, what is the benefit?

It has a big benefit for use cases where someone wants multiple tiers of backup storage with different cost and recovery characteristics. For example, long-term archival backups might go to cheap cold storage like tape, while backups that require faster recovery (better RTO) may use more expensive SSD-backed storage.

You can have multiple backup offerings for different use cases and VMs can be attached to the required offerings.
Furthermore, the backup storage usage is tracked per Backup Offering. So, users can use different offerings as per requirement and get charged accordingly.

The secondary storage also has both features. Although the capacity is not reported to the users currently.

Capacity is just for Admins for Backup Repository also, so that is not the issue.
But the Secondary Storage capacity tracking and alerts would be mixed up for Templates, Snapshots, Volumes and Backups.
With backup repositories, you will get separate capacity tracking and alerts just for the Backup Infrastructure.

Compression using qemu-img was a lot faster, with a bit smaller compression ratio. Furthermore, we have to consider
that the qemu-img compressed image can be used as-is, while the other images must be decompressed, further adding
to the processing time of backing up/restoring a backup.

It does have its benefits, but do keep in mind that the decompression cost will be paid by Reads. Also if someone is using a restored VM, they might see the physical size increase disproportionately to the actual data written due to decompression. It's still a useful feature and it's good that it is optional.

I did not want to add dozens of parameters to the import backup offering API which are only really going to be used for one provider. This way, the original design of the API is preserved.

If there is possibility these parameters can be added for other providers they should be added to the existing API.
Some of these like allowQuickRestore, allowExtractFile might as well be added for other providers such as Veeam and NAS in the future.

Furthermore, you may note that the APIs are intentionally not called createKnibBackupOffering, but createNativeBackupOffering. If other native providers want to use these offerings, they may do so by extending their implementations.

  • What defines a provider as Native?
  • Do the native offerings differ in functionality or usage to the existing BackupOfferings?
  • How will this look like in the UI? Will the user see Backup Offering and another Native Backup Offering?
  • Again, coupling an offering to storage has its benefits which current Backup Offerings already have.

@JoaoJandre
Copy link
Contributor Author

Hello, @abh1sar

It has a big benefit for use cases where someone wants multiple tiers of backup storage with different cost and recovery characteristics. For example, long-term archival backups might go to cheap cold storage like tape, while backups that require faster recovery (better RTO) may use more expensive SSD-backed storage.

You can have multiple backup offerings for different use cases and VMs can be attached to the required offerings. Furthermore, the backup storage usage is tracked per Backup Offering. So, users can use different offerings as per requirement and get charged accordingly.

You can have exactly that using this implementation. Essentially backup repositories and secondary storages are the same, with different names. Using selectors, you can have offerings that are tied to specific secondary storages, or you can have offerings that go into multiple storages, it was made to be flexible.

Capacity is just for Admins for Backup Repository also, so that is not the issue. But the Secondary Storage capacity tracking and alerts would be mixed up for Templates, Snapshots, Volumes and Backups. With backup repositories, you will get separate capacity tracking and alerts just for the Backup Infrastructure.

Again, you can have dedicated secondary storages using selectors.

If there is possibility these parameters can be added for other providers they should be added to the existing API. Some of these like allowQuickRestore, allowExtractFile might as well be added for other providers such as Veeam and NAS in the future.

While I was not the one that introduced the backup framework, looking at its design, it was clearly intended to have as little configuration as possible on the ACS side while leaving these details to the backup provider. If we add these parameters to the import backup offering API, I'm sure a lot of users will be confused when they do nothing for Veeam and Dell Networker. I did not intend to warp the original design of configuring the offerings on the provider side and only importing it to ACS.

This is why I created the concept of native offerings. As with KNIB (and NAS) the provider is ACS, and thus the configurations can still be made on the "backup provider", and the import API will follow the same logic it always had.

* What defines a provider as `Native`?

A provider is native if the backup implementation is made by ACS. Thus, the current native providers would be NAS and KNIB. Veeam and Networker are external providers.

* Do the native offerings differ in functionality or usage to the existing BackupOfferings?

They are used to configure the details that would be configured in the backup provider if it was an external provider, such as Veeam for example.

* How will this look like in the UI? Will the user see `Backup Offering` and another `Native Backup Offering`?

I have yet to add it to the GUI, if you have any suggestions for the GUI design, you are free to give your opinion. The GUI for the native backup offerings will be added in the future.

* Again, coupling an offering to storage has its benefits which current Backup Offerings already have.

Again, you absolutely can do this with KNIB. But you may also choose not to do it, its flexible.

@weizhouapache
Copy link
Member

@JoaoJandre
Although I’m not fully convinced that the mixed use of Secondary Storage is a good approach, I appreciate the work you’ve put into this. It would probably have been better to discuss the idea with the community earlier in the process, rather than presenting the design document and PR after most of the development had already been completed.

Would it be possible to change the name Native to something clearer? For example, "Secondary Storage Backup" might be more descriptive and easier for end users to understand. The term "Native" is not commonly used in CloudStack.

@github-actions
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@JoaoJandre
Copy link
Contributor Author

Would it be possible to change the name Native to something clearer? For example, "Secondary Storage Backup" might be more descriptive and easier for end users to understand. The term "Native" is not commonly used in CloudStack.

I don't see any issue with "Native", and I'm not claiming that KNIB is the only "Native" provider. As I explained before:

  • Native = CloudStack is the backup provider, implementing all of the processes necessary for the backup to be executed.
  • External = CloudStack integrates an external solution that is responsible for the backups, such as Veeam and Dell Networker.

This PR does not remove or substitute anything, the provider I am proposing is a new option for users to use.

@JoaoJandre
Copy link
Contributor Author

Also. The Validation was added in the last commits. All the features that were meant to be included in this PR are already here.

@weizhouapache
Copy link
Member

Would it be possible to change the name Native to something clearer? For example, "Secondary Storage Backup" might be more descriptive and easier for end users to understand. The term "Native" is not commonly used in CloudStack.

I don't see any issue with "Native", and I'm not claiming that KNIB is the only "Native" provider. As I explained before:

  • Native = CloudStack is the backup provider, implementing all of the processes necessary for the backup to be executed.
  • External = CloudStack integrates an external solution that is responsible for the backups, such as Veeam and Dell Networker.

This PR does not remove or substitute anything, the provider I am proposing is a new option for users to use.

We do not use the term "Native" in CloudStack. Plugin names are typically based on the name of the external device or backup repository they integrate with.

Please follow the same convention.

@bernardodemarco
Copy link
Member

Hello, @weizhouapache, all

We do not use the term "Native" in CloudStack

Yes, we currently do not have any APIs using the term Native. But, this does not imply or require us to avoid using this expression when it is suitable.

Plugin names are typically based on the name of the external device or backup repository they integrate with.

Yes, that is correct. In the case of the new backup plugin, it is fully implemented by ACS, including backup creation, chain management, compression, validation and other related processes. Therefore, its name is fully coherent with its functionality. In other words, KNIB (KVM Native Incremental Backup) stands for:

  • KVM: provides support for the KVM hypervisor;
  • Native: Fully native to Apache CloudStack, i.e., all processes are implemented by the cloud platform orchestrator itself;
  • Incremental Backup: Incremental backups are supported.

Please follow the same convention.

Sorry, but is this a established convention? I am not aware of any guideline stating that the term Native is not supported.

Again, if there is any semantic inconsistency with the plugin name KNIB (KVM Native Incremental Backup), we can address it. However, the plugin supports incremental backups, for KVM, natively in ACS; thus, the name accurately reflects what it delivers. It does not promote any concept or capability that is not actually provided.


@JoaoJandre, btw, thanks for all the effort you put into designing and implementing this backup plugin. It will certainly be a stable and efficient solution, with strong adoption across cloud providers and users.

@weizhouapache
Copy link
Member

Hello, @weizhouapache, all

We do not use the term "Native" in CloudStack

Yes, we currently do not have any APIs using the term Native. But, this does not imply or require us to avoid using this expression when it is suitable.

Plugin names are typically based on the name of the external device or backup repository they integrate with.

Yes, that is correct. In the case of the new backup plugin, it is fully implemented by ACS, including backup creation, chain management, compression, validation and other related processes. Therefore, its name is fully coherent with its functionality. In other words, KNIB (KVM Native Incremental Backup) stands for:

  • KVM: provides support for the KVM hypervisor;
  • Native: Fully native to Apache CloudStack, i.e., all processes are implemented by the cloud platform orchestrator itself;
  • Incremental Backup: Incremental backups are supported.

Please follow the same convention.

Sorry, but is this a established convention? I am not aware of any guideline stating that the term Native is not supported.

Again, if there is any semantic inconsistency with the plugin name KNIB (KVM Native Incremental Backup), we can address it. However, the plugin supports incremental backups, for KVM, natively in ACS; thus, the name accurately reflects what it delivers. It does not promote any concept or capability that is not actually provided.

@JoaoJandre, btw, thanks for all the effort you put into designing and implementing this backup plugin. It will certainly be a stable and efficient solution, with strong adoption across cloud providers and users.

I don’t want to continue debating this topic, so I’ll just summarize my view:

  • The mixed use of Secondary Storage is not a good approach.
  • The term "Native" is unclear and may confuse users. It would be better to follow the existing naming conventions used for other network/storage/backup plugins and choose a more descriptive name.
  • Please move API/DB/Java classes into the plugin as much as possible to keep the implementation isolated.

@DaanHoogland
Copy link
Contributor

@JoaoJandre @bernardodemarco , I have not looked at the code yet, but am trusting that it will be of the usual quality we are grown to expect from you guys.
I do agree with @weizhouapache that “native” is not the best choice of names. Something indicating the “external” platform, which is internal in this case would be better. so maybe “internal” or as Wei suggests, “secondary storage’, “ss" or “secstor”. My main reason is that other backup solution all have a native part as well.
About the repository discussion, is it possible to name the particular ss as repositories. The advantage would be that the backup framework remains of universal shape, and I don't think it has to impede any of the secstor backup functionality.

@abh1sar
Copy link
Contributor

abh1sar commented Mar 17, 2026

I have yet to add it to the GUI, if you have any suggestions for the GUI design, you are free to give your opinion. The GUI for the native backup offerings will be added in the future.

I would definitely not want two kinds of backup offering displayed separately in the UI. The way this could be approached is to have a unified view of two offerings with a type field differentiating between the two. The create offering form fields can be populated depending on the type chosen.
But if we are having a unified view in the UI, why have separate APIs at all? the backup offering API itself can have a type fields and all the extra fields which are required.

A provider is native if the backup implementation is made by ACS. Thus, the current native providers would be NAS and KNIB. Veeam and Networker are external providers.

But NAS doesn't use native backup offerings, so this is still confusing.

They are used to configure the details that would be configured in the backup provider if it was an external provider, such as Veeam for example.

I don't think we are gaining much from defining a new native backup offering api. I highly recommend having a single Backup offering API and database tables.

@JoaoJandre
Copy link
Contributor Author

Hello, @abh1sar, @DaanHoogland and @weizhouapache

* The mixed use of Secondary Storage is not a good approach.

I have already explained why the use of secondary storage is a good approach as a target for backups. But I can give you even more points as to why it is a good design:

  • It integrates the features ACS has had for many years. At the end of the day, a backup repository is a storage space introduced for the backups (in the NAS backup provider), which is basically the same as the secondary storage; the only difference is the API/method in which this information is presented to ACS. Moreover, I have implemented secondary storage migration for backups, which helps administrators with backup repository maintenance. Also, if the backup repository goes down for any reason, it is possible to temporarily direct backups to any other secondary storage and then switch back to the proper repository when needed. Furthermore, we can have dedicated secondary storage acting as backup repositories for different accounts, domains, projects, and so on. All of these are achieved with native concepts in ACS, which provide more flexibility for operators to handle their daily tasks and unexpected situations.
  • The validation screenshots can be downloaded, since the SSVM is already responsible for allowing the download of objects from inside the system, the screenshots would have to be passed to the secondary storage anyway to implement this workflow, for instance.
  • The secondary storage browser feature allows operators to visualize the backup files through the GUI.
  • It will be compatible with the alternative secondary storage protocol feature that is being developed, which I mentioned here Introduce new native backup provider (KNIB) #12758 (comment).

Regarding the naming, the KNIB is a native process, meaning ACS is doing all of the heavy lifting regarding the backup processing. Different from Veeam and other providers, where the heavy lift is done by the provider itself, and ACS is only a facade to enable users to assign/remove backup offerings (jobs in the backend of Veeam, for instance).

* The term "Native" is unclear and may confuse users. It would be better to follow the existing naming conventions used for other network/storage/backup plugins and choose a more descriptive name.

Using the term "Native" does not go against any guideline in the official coding conventions. And, if by convention you mean what is already on the code, if we only follow what is already existing, nothing new will ever be introduced.

* Please move API/DB/Java classes into the plugin as much as possible to keep the implementation isolated.

This was done from the beginning, but if you review and have a specific part of the code that you think could be more isolated, please share.

I would definitely not want two kinds of backup offering displayed separately in the UI. The way this could be approached is to have a unified view of two offerings with a type field differentiating between the two. The create offering form fields can be populated depending on the type chosen. But if we are having a unified view in the UI, why have separate APIs at all? the backup offering API itself can have a type fields and all the extra fields which are required.
I don't think we are gaining much from defining a new native backup offering api. I highly recommend having a single Backup offering API and database tables.

This actually goes against what @weizhouapache advised in the previous point. If I merge the two APIs, the KNIB implementation will be substantially less isolated. One of the points that I have brought from the beginning was to keep the implementation isolated to avoid issues with the current implementation ACS has and that we want to have an alternative solution.

Furthermore, the importBackupOffering API needs an external ID from the backup provider. If there is no way to create the native offerings, where would we get the external ID from? Also, this API is meant to import backup offerings that were already defined elsewhere, defining the backup behavior with this offering would distort the API. I would have to butcher the whole API and make lots of exceptions for KNIB. I don't like this design. The design I'm proposing is much more elegant.

But NAS doesn't use native backup offerings, so this is still confusing.

I am introducing the concept now of backup offerings to handle the different options we are providing in KNIB. NAS can absolutely implement compression, validation, file extraction, and so on, and start using the native backup offerings. It is up to the guys who developed and are maintaining it. We decided to create something from scratch, as it felt better to work with our quality standard from the beginning.

@JoaoJandre @bernardodemarco , I have not looked at the code yet, but am trusting that it will be of the usual quality we are grown to expect from you guys.

Thank you :)

I do agree with @weizhouapache that “native” is not the best choice of names. Something indicating the “external” platform, which is internal in this case would be better. so maybe “internal” or as Wei suggests, “secondary storage’, “ss" or “secstor”. My main reason is that other backup solution all have a native part as well.

Regarding the provider name, KVM Native Incremental Backup, I see no reason at all to change, it is impossible that it will ever confuse users.

Regarding the createNativeBackupOffering, listNativeBackupOfferings and deleteNativeBackupOffering (which are the only APIs with "native" in it, by the way), if it is truly absolutely necessary to not use the work "native" in a API, I can change these APIs to use the "internal" nomenclature. However, it feels a bit weird the name "internal" in this context. I am not seeing why the use of "native" wording here is so strongly opposed by you guys.

About the repository discussion, is it possible to name the particular ss as repositories. The advantage would be that the backup framework remains of universal shape, and I don't think it has to impede any of the secstor backup functionality.

I don't understand this suggestion.

@DaanHoogland
Copy link
Contributor

DaanHoogland commented Mar 17, 2026

Hello, @abh1sar, @DaanHoogland and @weizhouapache

I do agree with @weizhouapache that “native” is not the best choice of names. Something indicating the “external” platform, which is internal in this case would be better. so maybe “internal” or as Wei suggests, “secondary storage’, “ss" or “secstor”. My main reason is that other backup solution all have a native part as well.

Regarding the provider name, KVM Native Incremental Backup, I see no reason at all to change, it is impossible that it will ever confuse users.

@JoaoJandre , Why is asking to change it not a good enough reason. I am tired after a day long conflict resolution, so forgive me for not being patient with you. I find it strange to call that "no reason at all”. I read this as a sarcastic way of saying, “ok let’s change it”. You are right that "the term "Native" does not go against any guideline in the official coding conventions.”. So does that mean that it is a good metaphor for the software? I really object to the name and would like it changed.

I do not agree with Wei that using SecStor for backups is not a good idea. It may not be in high demanding production environments, but is smaller organisations it might well be a good solution. That does not mean that the name is not confusing. So to put it more firmly: Please change the name of the provider.

In addition to that, I see you are basically refusing all suggestions, which makes me suspicious. I didn’t involve myself in all of them but they can’t all be non-sensible. You clearly are invested in this PR, so I think it is in your interest to be more forthcoming to input.

I personally see no reason to continue with this PR without changing it.

If you are willing to make compromises, I will be happy to explain my point about backup repository usage, a point about framework adherence.

@JoaoJandre
Copy link
Contributor Author

@JoaoJandre , Why is asking to change it not a good enough reason. I am tired after a day long conflict resolution, so forgive me for not being patient with you. I find it strange to call that "no reason at all”. I read this as a sarcastic way of saying, “ok let’s change it”. You are right that "the term "Native" does not go against any guideline in the official coding conventions.”. So does that mean that it is a good metaphor for the software? I really object to the name and would like it changed.

I am not being sarcastic: asking for a change cannot be a reason for asking for a change. We have explained (see #12758 (comment)) why the name makes sense and is not confusing from our perspective. Yet I have not received a technical reason to change the name; I am only receiving "change it because I want it".

Anyways, I have an alternate name suggestion: KVM Incremental Backup Engine; is this better than KVM Native Incremental Backup? If so, why?

Regarding the API names, since "Native" is out of the question for API names (according to you), I will change the APIs to createInternalBackupOffering, listInternalBackupOffering, and deleteInternalBackupOffering.

I do not agree with Wei that using SecStor for backups is not a good idea. It may not be in high demanding production environments, but is smaller organisations it might well be a good solution.

I will repeat this as many times as necessary: secondary storage and backup repositories of the NAS backup plugin are functionally the same, the same appliance that exports an NFS path for a backup repository, can export an path NFS for secondary storage; and this secondary storage can be exclusive to backups if the operator of the cloud configures it to be. Sure, currently backup repositories support more protocols, but this is temporary, and also, this limitation is not a problem for this PR.

Therefore, secondary storage can be as good as backup repositories for any type of production environments. Thus, I will not change the implementation to use backup repositories as the NAS backup plugin. Moreover, if users prefer the other plugin, they are free to use the other one, instead of the implementation that I am proposing; I guess it is fine to have different plugins with different capabilities.

In addition to that, I see you are basically refusing all suggestions, which makes me suspicious. I didn’t involve myself in all of them but they can’t all be non-sensible. You clearly are invested in this PR, so I think it is in your interest to be more forthcoming to input.

Anyone can make any suggestion, but this does not mean that it is a good suggestion. All suggestions have been answered with in-depth explanations that are the basis for refusing the suggestion. I am open to suggestions based on good technical reasons and I am willing to discuss any possible changes that may help refine the feature, but I will not accept a suggestion until it is proven to be better than the current implementation.

As a final note, this PR is proposing a new backup plugin. At the end of the day, if users do not want to use it (or if users do not trust it), they are free to do so and adopt any other alternative. This solution is isolated on its own, and therefore, there should not be much resistance to it getting merged. We will not affect any already existing workflow or feature with this PR.

Besides that, I am willing to change the naming if that is important to you guys. What do you think if we call the plugin KVM Incremental Backup Engine or KVM Incremental Backup Service, and the APIs to createInternalBackupOffering, listInternalBackupOffering, and deleteInternalBackupOffering?

@DaanHoogland
Copy link
Contributor

asking for a change cannot be a reason for asking for a change

ok, another attempt, though I think I have explained; native is not a good metaphor for the software. It is a secondary storage backup solution.

Note also that just saying it is a good name does not make it a good name.

I have other points of feedback but if you are not willing to take this as an argument, there is no sense in talking about the details of the framework and how it fit or could fit better.

@JoaoJandre
Copy link
Contributor Author

Hello, @DaanHoogland

What about the alternate names I proposed based on your feedback and suggestions?

@vishesh92
Copy link
Member

I'd like to gently remind everyone — myself included — that we are all working toward the same goal here: making CloudStack better. This conversation has gotten a bit heated, and I think part of that is because a large PR with significant architectural decisions landed without earlier community discussion — something I've been guilty of myself in the past. For features of this scale, early alignment with the community helps avoid exactly this kind of friction. When that doesn't happen, it becomes even more important for the author to be open to the feedback that comes during review.

The Apache Ethos is a good reference for how we want to collaborate as a community. Disagreement on technical matters is healthy; let's just make sure we keep it respectful and constructive.


I have tried to go through the spec, the PR conversation, the existing backup framework docs, and the code at a high level. I might have missed some things given the size of this PR, so please correct me if I'm wrong on any point.

I can see that a lot of effort has gone into both the spec and the implementation. That said, I have some concerns that I think need to be addressed before this is production-safe.


Testing

The patch coverage is 5.59%. The classes that handle the most critical operations are almost completely untested.

Unit test coverage details
Class Coverage
NativeBackupServiceImpl.java 0%
NativeBackupServiceJobController.java 0%
BackupCompressionServiceJobController.java 0%
BackupValidationServiceController.java 0%
LibvirtTakeKnibBackupCommandWrapper.java 0.49%
LibvirtValidateKnibVmCommandWrapper.java 0.76%
KvmFileBasedStorageVmSnapshotStrategy.java 0%

These code paths run as root on hypervisor hosts and manipulate live VM disk chains. A bug here can cause data loss for tenants. This is not something we can rely on manual testing alone for — the next person who touches the VM snapshot strategy or volume migration code has no safety net to know they've broken something.

This is a 12k+ line PR with no Marvin tests. The manual test matrices in the PR description are good to have, but without automated integration tests, there is no way to catch regressions. Future changes to the VM snapshot code, volume lifecycle, or backup framework could silently break KNIB (or the other way around) with no automated detection.


Architecture

Using Secondary Storage for Backups

Three concerns with this approach
  • If the secondary storage selector allows clean separation of backup storage from everything else, this can work. But this needs to be clearly documented — operators must understand that without proper selector configuration, backups will share space and capacity tracking with templates, ISOs, and snapshots.
  • The secondary storage selector JavaScript engine has had security issues — see CVE-2025-59302. While this is currently blocked by a global setting, extending the selector framework to also cover backups increases the surface area. This should at minimum be documented with a note about the possible impact.
  • Backups are generally expected to be stored in a separate, remote location from the primary infrastructure. If the secondary storage is co-located, the value of backups (surviving an infrastructure failure) is reduced. Was this tested with any meaningful network latency between the compute hosts / SSVM and a remote storage used for backups? What latencies were observed?

BackupError VM State

The BackupError state has no recovery path — no admin API, no background self-healing task, nothing. The spec says an operator must read the logs and recover the VM manually. In a real deployment, a management server restart or agent timeout during a backup window could put many VMs into this state at once. The tenants can't start, stop, migrate, or snapshot their VMs until an admin manually fixes each one. At the very least, this limitation and its operational impact should be clearly documented.

Changes to Existing Features

This PR modifies 224 files. Many of the changes are not isolated to the KNIB plugin — they touch shared infrastructure.

Affected areas
  • VM state machine (VirtualMachine.java, UserVmManagerImpl.java) — two new states added. Every component that checks VM state is affected.
  • VM snapshot strategy (KvmFileBasedStorageVmSnapshotStrategy.java, 82 new lines, 0% coverage) — the disk-only VM snapshot revert, create, and delete paths are modified to account for backup deltas. This is an existing feature that other users depend on. A bug here breaks VM snapshots for everyone, not just KNIB users.
  • Volume operations — volume attach, detach, destroy, and migrate paths are all modified to handle backup deltas.
  • Secondary storage selectors (HeuristicType.java) — extended for backups.

These changes are all reasonable for the feature, but combined with 0% test coverage on the affected paths, they carry a regression risk for users who don't use KNIB at all.

Naming

I don't have a strong opinion on the exact name, but I find "native" a bit unclear — native to what? KVM? CloudStack? The hypervisor? Something more descriptive of the actual mechanism would help operators understand what they're enabling. I'll leave this to the community to decide.

I'd also note that the coding conventions document was written by the community and will keep evolving. Citing it as a definitive authority on what is or isn't acceptable doesn't fully capture how we work — the conventions are a living document, and community consensus on what is good for the project's future carries equal weight.


Separate APIs vs Reusing Existing APIs

I would strongly prefer reusing the existing backup offering APIs rather than creating parallel ones. The existing importBackupOffering API could be extended — for example, making externalId optional or using some other value like internal, native or whatever makes sense as value for this. This keeps a single unified offering model in the framework, the UI, and the documentation.

Why this matters

To give an example: when I worked on the GPU feature, the approach was to extend the existing VM deployment APIs rather than create a separate createGPUEnabledVM command. The same principle should apply here.

Given that this project does not deprecate APIs, once the new createNativeBackupOffering / listNativeBackupOfferings / deleteNativeBackupOffering APIs ship, we are committed to maintaining them forever alongside the existing offering APIs. This is a decision that is very hard to walk back.

More broadly: if every new backup provider introduces its own offering API, we end up with an ever-growing API surface. If someone contributes another backup provider tomorrow, would we accept a third set of offering APIs? I don't think we would — and if we wouldn't accept it then, we should think carefully about accepting it now.


On Reusing and Improving Existing Code

One of the strengths of a community project is that we, as a community, collectively own and improve the codebase. None of us are working on code that "belongs" to us individually — we are all changing, refactoring, and improving code that someone else wrote years ago, sometimes by contributors who are no longer active in the project. That is how open source works and how the project stays healthy.

More on this

When existing code has gaps or quality concerns, the community expectation is that we improve it — not build a parallel implementation alongside it. If the existing NAS backup plugin has shortcomings (no incremental support, no compression, no validation), those are opportunities for contribution that benefit all users of the existing plugin.

Building KNIB as a completely separate provider with its own offering APIs, its own DB tables, and its own storage model means the NAS plugin's gaps remain unaddressed, and now we have two native backup implementations to maintain. This also sets a precedent: if the reasoning is "the existing code doesn't meet our standards, so we'll build our own," then any contributor could justify creating parallel implementations of any subsystem. Over time, this fragments the codebase rather than strengthening it.

The Apache Ethos emphasizes that community contributions should strengthen the commons, not create parallel silos. I'd encourage us all — including myself — to keep this principle in mind.


Security

Validation Command Injection

The backupValidationCommand VM setting is user-writable by default. From what I can see, the command and arguments are put into a JSON string using String.format():

String GUEST_EXEC_COMMAND = "{\"execute\": \"guest-exec\", \"arguments\":{\"path\":\"%s\",\"arg\":%s,\"capture-output\":true}}";
Example of how this can be exploited

If the path value is not sanitized, a crafted command string can break out of the JSON structure. For example:

String script = "validation.sh;\nls /etc/\n\",\"extra_param\":\"injected_value";
String command = String.format(GUEST_EXEC_COMMAND, script, "[]");

This would produce malformed JSON that could be interpreted differently by the QEMU guest agent.

Validation Resource Exhaustion

I don't see a hard cap on the total validation lifecycle time. A user could repeatedly trigger backups on stopped VMs with validation-enabled offerings, causing validation VMs to be created on compute hosts. Combined with enforce.resource.limit.on.backup.validation.vm defaulting to false and backup.validation.max.concurrent.operations defaulting to 0 (unlimited), this could lead to resource exhaustion on compute hosts, affecting other tenants.

Repeated backup commands on stopped VMs could potentially cause libvirt to become unresponsive on the host, which would then impact all VMs on that host.


I can see that a lot of work has gone into this. My concerns are about making sure this is safe and maintainable for the long term. Happy to discuss any of these points further.

I'd also like to request everyone — contributors and reviewers alike — to please consider each other's feedback with an open mind. We as a community may not agree on everything, but we can disagree respectfully. This PR will be better for it, and so will the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants