Kopiur
Kopiur (Kopia + Rust) is a Kopia-native Kubernetes backup operator written in
Rust on kube-rs. It makes a kopia repository
a first-class Kubernetes resource and separates the backup recipe from its
invocation from its schedule, so backups can be triggered by cron,
kubectl create, Argo Events, or a Helm hook — and a kopia snapshot's lifecycle
is tied to its Backup CR by a finalizer + deletionPolicy.
The whole CRD surface is modeled as Rust enums so invalid states are unrepresentable and reconcilers handle every variant at compile time. See ADR-0003 for the full design.
API group kopiur.home-operations.com, version v1alpha1. The CRD surface may
still change between releases.
The 7 CRDs (kopiur.home-operations.com/v1alpha1)
| CRD | Scope | Layer | Purpose |
|---|---|---|---|
Repository | Namespaced | Storage | A kopia repository owned by one namespace: backend, encryption, credentials. |
ClusterRepository | Cluster | Storage | A shared repository for platform teams, gated by allowedNamespaces. |
BackupConfig | Namespaced | Recipe | What to back up: PVC sources, identity, retention, policy, hooks. Idempotent. |
Backup | Namespaced | Invocation + Catalog | One kopia snapshot as a Kubernetes object. The universal trigger entry point. |
BackupSchedule | Namespaced | Cron | When it runs: cron + jitter + timezone; creates Backup CRs. |
Restore | Namespaced | Operation | Restore a snapshot to a PVC, or act as a passive volume-populator source. |
Maintenance | Namespaced | Lifecycle | Schedules kopia maintenance quick + full with an ownership lease. |
Where to next
- Installation — prerequisites, install modes, and the CRD-lifecycle caveat.
- API reference (rustdoc) — the generated Rust API docs for every crate in the workspace.
- API conventions and Observability — developer notes.
- ADR-0003 — the canonical design document.
Installing Kopiur
Kopiur is a Kopia-native Kubernetes backup operator (Rust / kube-rs). This guide covers installing the operator with the bundled Helm chart and verifying it.
Status: alpha — API group
kopiur.home-operations.com, versionv1alpha1. The CRD surface may still change between releases.
Prerequisites
- Kubernetes >= 1.24. The deploy-or-restore volume-populator path
(
Restore+PVC.spec.dataSourceRef) relies on theAnyVolumeDataSourcefeature, available from 1.24 (ADR §4.7). - Helm 3 or 4.
- A kopia repository backend you can reach: S3/MinIO, Azure Blob, GCS, B2, filesystem (PVC), SFTP, WebDAV, or rclone.
- (Optional) cert-manager — the simplest way to provision the admission webhook's serving certificate. Without it you provide the cert yourself.
- (Optional) volume-data-source-validator — recommended alongside CSI
populators so a malformed
dataSourceRefis surfaced as an event rather than a silently-stuck PVC (ADR §4.7). - (Optional) Prometheus Operator — if you want the chart's
ServiceMonitor.
Quickstart
# 1. Create the operator namespace.
kubectl create namespace kopiur-system
# 2. Install the chart. Easiest path: let cert-manager mint the webhook cert.
helm install kopiur deploy/helm/kopiur \
--namespace kopiur-system \
--set webhook.certManager.enabled=true
# 3. Wait for rollout.
kubectl -n kopiur-system rollout status deploy/kopiur-controller
kubectl -n kopiur-system rollout status deploy/kopiur-webhook
# 4. Confirm the 7 CRDs are registered.
kubectl get crd -l app.kubernetes.io/part-of=kopiur
Without cert-manager
The webhook serves TLS and the API server must trust it. If you are not using cert-manager, create the serving Secret and pass the CA bundle:
# create a kubernetes.io/tls Secret named per webhook.tls.secretName, then:
helm install kopiur deploy/helm/kopiur \
--namespace kopiur-system \
--set webhook.certManager.enabled=false \
--set webhook.tls.secretName=kopiur-webhook-tls \
--set webhook.caBundle="$(base64 -w0 ca.crt)"
Or disable the webhook entirely (validation then relies on the controller's defensive checks only — not recommended):
helm install kopiur deploy/helm/kopiur -n kopiur-system --set webhook.enabled=false
Install scope
| Mode | --set installScope= | RBAC | Manages | ClusterRepository |
|---|---|---|---|---|
| Namespaced (default) | namespaced | Role | release namespace only | not reconciled |
| Cluster | cluster | ClusterRole | cluster-wide | reconciled |
Use cluster scope for a shared platform repository (ClusterRepository)
referenced by many tenant namespaces. See deploy/examples/02-cluster-repository.yaml.
CRD lifecycle
installCRDs: true (default) installs the 7 CRDs as Helm templates, so the
flag is honored and helm upgrade re-applies schema changes.
Caution: with templated CRDs,
helm uninstall kopiurdeletes the CRDs and everykopiur.home-operations.comobject in the cluster (Repositories, Backups, ...). For an alpha API this is the intended, predictable behavior. To decouple CRD lifecycle from the release (e.g. GitOps), install with--set installCRDs=falseand apply the generated CRDs out of band:# Server-side apply is required: the BackupConfig CRD embeds a full JobSpec # (runJob hook) and is too large for the client-side last-applied annotation. kubectl apply --server-side -f deploy/crds/all-crds.yaml
The CRDs and RBAC shipped by the chart are generated by
cargo xtask gen-crds / cargo xtask gen-rbac and checked in under
deploy/crds/ and deploy/rbac/. Those xtasks are the source of truth.
First backup
After install, create a repository and start backing up a PVC. The smallest
end-to-end example is deploy/examples/01-single-pvc-scheduled.yaml:
kubectl apply -f deploy/examples/01-single-pvc-scheduled.yaml
kubectl get repositories,backupconfigs,backupschedules -n billing
Eight runnable walkthroughs live in deploy/examples/:
| File | Pattern |
|---|---|
01-single-pvc-scheduled.yaml | Single PVC, scheduled daily |
02-cluster-repository.yaml | Shared platform ClusterRepository (cluster scope) |
03-restore-by-backup.yaml | Restore by picking a Backup |
04-multi-pvc-selector.yaml | Multi-PVC label selector + group snapshot |
05-deploy-or-restore-gitops.yaml | Deploy-or-restore (PVC dataSourceRef) |
06-manual-backup.yaml | Manual one-shot Backup |
07-restore-discovered.yaml | Restore a discovered / foreign snapshot |
08-maintenance.yaml | kopia maintenance schedule + ownership lease |
Observability
- The controller serves
/metrics,/healthz, and/readyzon its probe port (:8080); the webhook serves/metricson its TLS port. All metrics are under thekopiur_namespace. metrics.enabled=true(default) creates a metricsService.metrics.serviceMonitor.enabled=truecreates a Prometheus-OperatorServiceMonitor(requires the Prometheus-Operator CRDs);metrics.prometheusRule.enabled=trueships the kopiur alert rules.grafanaDashboard.enabled=trueships the Grafana dashboard as a sidecar-discoverableConfigMap(source:deploy/dashboards/kopiur.json).observability.otlp.enabled=true(withobservability.otlp.endpoint) additionally exports OTLP traces, logs, and a metrics push from the controller, webhook, and mover Jobs. Off by default.
Turn it all on with the ready-made overlay:
helm upgrade kopiur deploy/helm/kopiur -n kopiur-system \
-f deploy/observability-values.yaml
See docs/dev/observability.md for the full metric list,
OTLP details, and a sample collector config.
Upgrade / uninstall
helm upgrade kopiur deploy/helm/kopiur -n kopiur-system # re-applies CRD schema
helm uninstall kopiur -n kopiur-system # see CRD caution above
See also
- Design:
docs/adr/0003-kopiur-rust-operator.md - Chart values & modes:
deploy/helm/kopiur/README.md
API reference (rustdoc)
The Rust API documentation for every crate in the Kopiur workspace —
kopiur-api, kopiur-kopia, kopiur-telemetry, kopiur-controller,
kopiur-webhook, kopiur-mover, and xtask — is generated with cargo doc and
published alongside this book.
kopiur-api is the best entry point: it holds the strongly-typed CRD definitions
and the shared validation/identity/retention logic, with no controller-runtime
dependencies.
Open the rustdoc API reference →
The API reference is built from the same commit as this book. It is nested under
/rustdoc/; the link above lands on a redirect into thekopiur_apicrate.
kopiur-api conventions (READ BEFORE EDITING crates/api)
These conventions are load-bearing — they were derived empirically against
kube 3.1 + k8s-openapi 0.27 + schemars 1.2 on Rust 1.95. Violating them
breaks either CRD schema generation or compilation. ADR-0003 is the source of truth
for what the fields are; this file is how to encode them in Rust.
1. CRD top-level types
#![allow(unused)] fn main() { #[derive(CustomResource, Serialize, Deserialize, Clone, Debug, PartialEq, JsonSchema)] #[kube( group = "kopiur.home-operations.com", version = "v1alpha1", kind = "BackupConfig", namespaced, // OMIT this line for ClusterRepository (cluster-scoped) status = "BackupConfigStatus", shortname = "kopiabc", category = "kopiur", printcolumn = r#"{"name":"Phase","type":"string","jsonPath":".status.phase"}"# )] #[serde(rename_all = "camelCase")] pub struct BackupConfigSpec { ... } }
- The
kindderive generates the root struct named bykind(e.g.BackupConfig), with your*Specas.specand*Statusas.status. Re-export both fromlib.rs. - Every spec/sub-object/status struct:
#[serde(rename_all = "camelCase")].
2. Discriminated unions = externally-tagged Rust enums
Do NOT use #[serde(tag = "...")] (internally tagged). kube's structural-schema
rewriter hoists oneOf branch properties to the root and panics if a shared property
(the tag) differs across branches. Use serde's default external tagging:
#![allow(unused)] fn main() { #[derive(Serialize, Deserialize, Clone, Debug, PartialEq, Eq, JsonSchema)] #[serde(rename_all = "camelCase")] pub enum Backend { S3(S3Backend), Filesystem(FilesystemBackend), ... } }
Wire shape: backend: { s3: {...} } (this matches ADR-0001 §3.1's YAML). The enum
still gives compile-time "exactly one variant" + exhaustive match — the ADR §5.5
thesis is fully preserved. Provide a kind_str(&self) -> &'static str helper for
status/metrics/printcolumns. Webhook validates per-variant content.
This applies to: Backend, AllowedNamespaces, RestoreSource, RestoreTarget,
Hook, and any other "exactly one of" surface.
Simple closed string enums (no payload) are fine as plain unit enums and serialize as
strings: DeletionPolicy{Delete,Retain,Orphan}, Origin, *Phase, RepositoryKind,
ConcurrencyPolicy, etc. Give them #[derive(... Copy, Eq, Default ...)] and mark the
default variant #[default].
3. Eq and k8s-openapi types
k8s-openapi types (LabelSelector, ResourceRequirements, SecurityContext,
PodSpec, JobSpec, Condition, …) implement PartialEq but not Eq. Any struct
embedding one (directly or transitively) must derive PartialEq only — never Eq.
Reuse these types from k8s-openapi; do not re-invent them. The schemars feature is
enabled on k8s-openapi workspace-wide so they derive JsonSchema.
Use k8s_openapi::apimachinery::pkg::apis::meta::v1::{LabelSelector, Condition} and
k8s_openapi::api::core::v1::{ResourceRequirements, SecurityContext, ...}.
4. Optional blocks & forward-compat (ADR §4.11)
Every credential/policy/identity/schedule surface is a sub-object, not a leaf field.
Optionals: #[serde(default, skip_serializing_if = "Option::is_none")] pub x: Option<T>.
Bools that default false: #[serde(default, skip_serializing_if = "std::ops::Not::not")].
Vecs: #[serde(default, skip_serializing_if = "Vec::is_empty")].
5. Status
Always carries resolved.* pinned values (ADR §4.2: resolved identity pinned at
admission, never re-rendered). conditions: Vec<Condition> using the k8s-openapi type.
Phase is a closed enum with a #[default] of Pending.
6. Tests (in each CRD module or tests/)
Use the YAML→JSON→typed bridge (the API-server path), NOT serde_yaml directly
(serde_yaml 0.9 encodes externally-tagged enums as non-standard !Variant tags):
#![allow(unused)] fn main() { fn from_yaml<T: serde::de::DeserializeOwned>(yaml: &str) -> T { let v: serde_json::Value = serde_yaml::from_str(yaml).unwrap(); serde_json::from_value(v).unwrap() } }
Per CRD, test: (a) T::crd() group/kind/scope/version; (b) round-trip the exact ADR
YAML and assert key fields + structural spec == reparse(serialize(spec)); (c) each
union variant (de)serializes under its expected key; (d) unknown variant is rejected.
Run: cargo test -p kopiur-api. Schema generation is exercised by any T::crd() call —
if an enum is mis-encoded, that call panics, so the crd() test catches it.
Observability
Kopiur exposes Prometheus metrics, and can additionally export OpenTelemetry
(OTLP) traces, logs, and metrics. The implementation lives in the
kopiur-telemetry crate, shared by the controller, webhook, and mover.
The one idea: instrument once, two readers
Metrics are instrumented once against the OpenTelemetry metrics API. A single
SdkMeterProvider fans out to two readers:
- an
opentelemetry-prometheusexporter that populates aprometheus::Registrybehind the always-on/metricspull endpoint (so aServiceMonitorscrapes the pods directly — no collector required), and - an OTLP
PeriodicReaderthat pushes the same measurements to a collector — added only whenOTEL_EXPORTER_OTLP_ENDPOINTis set.
Recording a value updates both; there is no double instrumentation. Traces (the
controller's #[instrument] reconcile spans) and logs (bridged from tracing
events) export over OTLP via tracing-opentelemetry and
opentelemetry-appender-tracing.
OTLP is env-gated and off by default. With no endpoint configured the
behavior is identical to fmt-only logging + the Prometheus pull, so the hermetic
test suite stays offline. A misconfiguration is logged with an actionable error
and degrades to fmt-logging + the Prometheus pull rather than crashing a
backup operator — unless KOPIUR_OTEL_STRICT=true, which makes it fail fast.
Logging (stdout / kubectl logs)
Every component writes structured tracing events to stdout via an fmt layer
installed by kopiur_telemetry::init_tracing. No collector is needed — this is the
always-on path that kubectl logs shows. Reconcilers carry a #[instrument] span
with kind, namespace, and name, so each line is attributable to the resource
being reconciled.
Level — the standard RUST_LOG filter (default info). Per-target directives
work: RUST_LOG=info,kopia=debug keeps the operator at info while surfacing
kopia's own progress and log output (emitted line-by-line under the kopia
target) in mover and controller logs. Without it, kopia's output is captured for the
failure tail but not printed.
Format — KOPIUR_LOG_FORMAT selects text (human-readable, default) or json
(one structured object per line for Loki/ELK/Datadog). An unrecognized value
degrades to text with a warning. In text mode ANSI color is suppressed when
stdout is not a TTY (i.e. in a container), so kubectl logs stays clean.
Movers inherit the controller's config. The controller forwards both RUST_LOG
and KOPIUR_LOG_FORMAT (alongside the OTLP vars) onto every mover Job, so a
backup/restore Job logs at the same level and format — set it once on the controller.
Helm knobs (logging.*, applied to controller + webhook, and through to movers):
| Key | Default | Effect |
|---|---|---|
logging.level | "" → falls back to controller.logLevel | sets RUST_LOG (e.g. info,kopia=debug) |
logging.format | text | sets KOPIUR_LOG_FORMAT (text/json) |
controller.logLevel | info | deprecated alias for logging.level (kept for back-compat) |
# JSON logs everywhere, and show kopia's progress in mover logs:
helm upgrade --install kopiur deploy/helm/kopiur -n kopiur-system \
--set logging.format=json --set logging.level='info,kopia=debug'
HTTP endpoints
| Component | Endpoint | Notes |
|---|---|---|
| Controller | GET /metrics, /healthz, /readyz on :8080 (axum) | probes hit the real health routes |
| Webhook | GET /metrics on its TLS port (8443) | plus /healthz, /readyz |
| Mover | none (short-lived Job) | OTLP push only; flushed before exit |
Metrics
All metrics are under the kopiur_ namespace. The Prometheus exporter applies
the OTel→Prometheus conventions, so a counter instrument named
kopiur_x is exported as kopiur_x_total.
| Metric | Type | Labels | Source |
|---|---|---|---|
kopiur_controller_reconciliations_total | counter | kind | every reconcile |
kopiur_controller_reconcile_errors_total | counter | kind, class (transient/structural) | error_policy |
kopiur_controller_reconcile_duration_seconds | histogram | kind | every reconcile |
kopiur_resource_phase | gauge (0/1) | kind, namespace, name, phase | CR status; 1 = active phase, 0 = others; zeroed on deletion |
kopiur_backup_last_success_timestamp_seconds | gauge | namespace, name | Backup → Succeeded |
kopiur_backup_consecutive_failures | gauge | namespace, name | BackupConfig reconcile (trailing Failed before the latest Succeeded) |
kopiur_backup_size_bytes | gauge | namespace, name | Backup status.stats.sizeBytes |
kopiur_backup_files | gauge | namespace, name | Backup file counts (absent when unknown) |
kopiur_backup_duration_seconds | gauge | namespace, name | Backup status.timing.durationSeconds |
kopiur_orphaned_snapshots_total | counter | namespace | Orphan policy / skip-cleanup escape hatch |
kopiur_snapshot_deletion_failures_total | counter | namespace | finalizer snapshot-delete failures |
kopiur_schedule_backups_created_total | counter | namespace, name | BackupSchedule fires |
kopiur_repo_size_bytes | gauge | namespace, name | logical bytes under management (newest snapshot per source) |
kopiur_repo_snapshot_count | gauge | namespace, name | repository catalog scan |
kopiur_repo_discovered_backups | gauge | namespace, name | repository catalog scan |
kopiur_repository_maintenance_configured | gauge (0/1) | kind, namespace, name | Repository/ClusterRepository reconcile once Ready; 1 = a Maintenance references it, 0 = none (also emits a MaintenanceNotConfigured Warning event + MaintenanceConfigured condition) |
kopiur_restore_duration_seconds | gauge | namespace, name | restore Job completion − start |
kopiur_maintenance_last_reclaimed_bytes | gauge | namespace, name | full maintenance run |
kopiur_webhook_admission_total | counter | kind, decision (allowed/denied) | admission webhook |
kopiur_mover_operations_total | counter | operation, result | mover Job (OTLP push) |
kopiur_mover_operation_duration_seconds | histogram | operation, result | mover Job (OTLP push) |
Notes:
kopiur_resource_phaseis zeroed when a CR is deleted so… == 1alerts clear before the object is garbage-collected (OTel sync gauges can't drop a series; zeroing is the available remedy). Series for long-deleted resources persist at0.- Per-resource gauges are re-read from the freshest status on each successful reconcile, so they don't lag a cycle behind a phase transition.
Enabling everything (Helm)
helm upgrade --install kopiur deploy/helm/kopiur -n kopiur-system \
--set metrics.serviceMonitor.enabled=true \
--set metrics.prometheusRule.enabled=true \
--set grafanaDashboard.enabled=true \
--set webhook.serviceMonitor.enabled=true \
--set observability.otlp.enabled=true \
--set observability.otlp.endpoint=http://otel-collector.observability.svc:4317
A ready-to-use values overlay is at
deploy/observability-values.yaml:
helm upgrade --install kopiur deploy/helm/kopiur -n kopiur-system \
-f deploy/observability-values.yaml
Keys (see deploy/helm/kopiur/values.yaml for the full set):
| Key | Default | Effect |
|---|---|---|
metrics.serviceMonitor.enabled | false | scrape the controller /metrics |
metrics.prometheusRule.enabled | false | install the kopiur alert rules |
grafanaDashboard.enabled | false | ship the dashboard as a sidecar ConfigMap |
webhook.serviceMonitor.enabled | false | scrape the webhook /metrics (HTTPS) |
observability.otlp.enabled | false | export OTLP from all components |
observability.otlp.endpoint | …:4317 | collector gRPC endpoint (required when enabled) |
observability.otlp.protocol | grpc | only gRPC is compiled in |
observability.otlp.headers | "" | e.g. authorization=Bearer … |
observability.otlp.strict | false | fail-fast on telemetry misconfig |
When OTLP is enabled the controller passes the same OTEL_EXPORTER_OTLP_* env to
every mover Job it creates, so mover traces/logs/metrics reach the same collector.
Environment variables
The env var names are centralized in crates/telemetry/src/env.rs
(OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL,
OTEL_EXPORTER_OTLP_HEADERS, KOPIUR_OTEL_STRICT, plus the logging vars
RUST_LOG and KOPIUR_LOG_FORMAT); the Helm observability.otlp and logging
blocks set them. Only gRPC is compiled in — point the endpoint at the collector's
gRPC port (4317). Setting OTEL_EXPORTER_OTLP_PROTOCOL to anything other than
grpc is rejected with an actionable error.
OTLP_PASSTHROUGH and LOG_PASSTHROUGH (same module) list the vars the controller
forwards onto mover Jobs: OTLP only when a collector is configured, logging
whenever set.
Dashboard
deploy/dashboards/kopiur.json is the source of truth (import it into Grafana
directly). The chart copy under deploy/helm/kopiur/files/dashboards/kopiur.json
is generated from it by cargo xtask gen-all and guarded by
cargo xtask gen-all --check, so the two can never drift. Edit the source, then
regenerate.
Grafana via the OTLP path
If you run OTLP-only and don't scrape the pods, point Prometheus at the collector instead. A minimal OpenTelemetry Collector that ingests OTLP and re-exposes a Prometheus scrape target:
# otel-collector config (configmap data)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
prometheus:
endpoint: 0.0.0.0:8889 # scrape this with Prometheus
# debug: # uncomment to see traces/logs in the collector log
service:
pipelines:
metrics: { receivers: [otlp], exporters: [prometheus] }
traces: { receivers: [otlp], exporters: [debug] }
logs: { receivers: [otlp], exporters: [debug] }
For most users the direct-scrape ServiceMonitor path is simpler; OTLP is for
shops that already run a collector and want traces + logs alongside metrics.
ADR-0003: Kopiur — A Kopia-Native Backup Operator in Rust
- Status: Proposed
- Date: 2026-06-01
- Supersedes: ADR-0001 (onedr0p draft), ADR-0002 (bo0tzz draft)
- Inspired by:
backube/volsync, the kopia forkperfectra1n/volsync(especially PRbackube/volsync#1723and the trigger-redesign proposalbackube/volsync#1559), CloudNativePG (Cluster/ScheduledBackup/Backup), and Tekton (Task/TaskRun). - Implementation language: Rust, built on
kube-rs.
Scope: this ADR covers CRD shape, user experience, high-level design choices, and the Rust/kube-rs implementation surface. It deliberately defers specific controller-runtime lease IDs, the cron-library choice (
tokio-cron-schedulervscronervs custom), and per-finalizer reconcile-loop layout to follow-up ADRs once the API is agreed.
This document is the canonical ADR-0001 for the kopiur project. The two predecessor drafts (docs/adr/0001-onedr0p-kopia-operator.md, docs/adr/0002-bo0tzz-kopia-operator.md) are preserved as historical input — they assumed Go and disagreed on several points. This ADR resolves those disagreements explicitly:
| Topic | onedr0p draft | bo0tzz draft | Kopiur (this ADR) |
|---|---|---|---|
| CRD count | 7 (Repository, ClusterRepository, BackupConfig, Backup, BackupSchedule, Restore, Maintenance) | 5 (no ClusterRepository, Maintenance merged loosely) | 7 — keep ClusterRepository (§3.2) |
| Successful retention | GFS only (BackupConfig.spec.retention) | GFS and successfulJobsHistoryLimit | GFS only (§4.4); failures bounded separately |
| Snapshot deletion when CR deleted | deletionPolicy (Delete default for produced, Retain forced for discovered) | Not addressed | Adopt onedr0p model (§4.5) |
| Implementation language | Go (controller-runtime) | Go (controller-runtime) | Rust (kube-rs + tokio) |
| Mover image | Go binary + kopia | Go binary + kopia | Rust binary + kopia (§4.10) |
1. Context
VolSync is the de-facto Kubernetes-native PVC mover. Its design is mature and battle-tested, but it has accreted around restic's model. As soon as you try to add a non-restic mover (kopia, rustic, borg, …) several deep design choices push back. The community fork perfectra1n/volsync proves out a kopia mover and ships a usable image — but its PR has been open ~13 months without merging, upstream maintainers are capacity-constrained, and many users have switched to running the fork in production.
The fork's existence and the volume of feature requests around kopia/restic locking, multi-PVC backup, scheduling jitter, restore UX, trigger separation, snapshot lifecycle, and "stop running on apply" suggest something stronger than "land kopia in volsync" is warranted. A kopia-native operator can:
- Drop the multi-mover abstraction entirely. Kopia is the only mover, so every CRD field can be expressive without leaking through a generic shape.
- Make a repository a first-class Kubernetes resource — at both namespace and cluster scope. Kopia repos are designed to be shared across many writers, including across namespaces.
- Separate recipe, invocation, and schedule so backups can be triggered by any source (cron,
kubectl create, Argo Events, button-in-Grafana). Volsync'striggerfield couples all three. - Use kopia's native identity model (
username@hostname:path) deliberately rather than as an accident ofmetadata.name/metadata.namespace. - Treat
kopia maintenanceand snapshot lifecycle as first-class operator concerns rather than retrofits. - Tie the lifecycle of a Kopia snapshot to the lifecycle of its
BackupCR by default, with explicit opt-outs — addressing the persistent volsync confusion that deleting aReplicationSourcehas no effect on snapshots in the repository. - Surface kopia's snapshot catalog through CRDs so restore is "browse and reference," not "construct an
restoreAsOftimestamp and hope." - Address the long backlog of papercuts as design decisions, not bug fixes.
The API group is kopiur.home-operations.com with initial version v1alpha1. The project name is kopiur (Kopia + Rust); the binary, container, and helm chart all use that name.
Group rename (post-decision): earlier drafts of this ADR used the group
kopia.io. That is the upstream Kopia project's own domain — using it for our CRDs would wrongly imply their ownership/endorsement — so the group, and everykopia.io/-prefixed finalizer, label, and annotation, were moved tokopiur.home-operations.com. References to the real Kopia project's documentation (e.g.https://kopia.io/docs/) are unchanged. The predecessor drafts ADR-0001/0002 are left as historical record and still show the originalkopia.io.
1.1 The most important gaps we are addressing
| # | Gap | volsync refs |
|---|---|---|
| G1 | Repository is not a Kubernetes resource; cannot be shared/reused cleanly | implicit; perfectra1n CRD shape |
| G2 | One ReplicationSource = one PVC | #1115, #1116, #320 |
| G3 | First reconcile triggers an immediate backup, no GitOps-friendly "skip first run" | #627 |
| G4 | No cron jitter / H substitution, no timezone | #1421, #702 |
| G5 | Restic repo locking / piling-up jobs | #1042, #1429, #646 |
| G6 | No retry-limit / backoffLimit override | #1228, #1042 |
| G7 | Restore proceeds with empty PVC if no snapshot found | #1211 |
| G8 | Snapshot selection is restic-format restoreAsOf only; no browse | #7, #1211 |
| G9 | latestImage always wins — no immutable restore source | disc #1115 |
| G10 | Volume populator + Direct copyMethod incompatibility | disc #1115, #1129 |
| G11 | Maintenance ownership is implicit & runs in the same pod as backup | perfectra1n fork redesigned this three times |
| G12 | Policy passthrough is brittle: every kopia knob needs CRD/jq script changes | fork #13, #23 |
| G13 | Snapshot actions run in mover, not workload | fork #22 |
| G14 | OOMs unpredictable; no resource guidance | #626, #707, #1228 |
| G15 | Mover image is :latest by default | volsync restic/builder.go:42 |
| G16 | Restricted PSA / OpenShift SCC / unprivileged-mode lost+found papercuts | #367, #1033, #1889, #1430 |
| G17 | Trigger semantics are baked into the source CR — no manual/external trigger path | #1559 |
| G18 | Mover-pod lifecycle (zombie pods, stuck jobs) | fork #8, volsync #1415 |
| G19 | Maintainers' explicit door-closing on new movers | #1743, #1029, #320 |
| G20 | Deleting the source CR doesn't delete snapshots from the repository | implicit |
| G21 | No Rust-native, type-safe controller surface for the Kubernetes ecosystem's backup tier | (new) |
G21 is the new entry: it's not a volsync defect, it's a positive reason to choose Rust. Memory safety, exhaustive enum matching at the type level, and kube-rs's strongly-typed CRD derive macro produce a controller surface where invalid states are unrepresentable at compile time — exactly the property a stateful data-protection controller wants.
2. Decision
2.1 Topology
Seven CRDs in kopiur.home-operations.com/v1alpha1. Six are namespaced; ClusterRepository is cluster-scoped.
| CRD | Scope | Layer | Purpose |
|---|---|---|---|
Repository | Namespaced | Storage | A kopia repository owned by one namespace: credentials, backend, encryption, optional catalog-materialization bounds. Many BackupConfigs/Restores reference one. |
ClusterRepository | Cluster | Storage | A shared kopia repository operated by the platform team, referenceable from allow-listed namespaces. Identity defaults are templated per consumer namespace. |
BackupConfig | Namespaced | Recipe | What to back up: PVC selector, identity, retention, policy, hooks. Idempotent — doesn't run anything on its own. |
Backup | Namespaced | Invocation + Catalog | A single kopia snapshot as a Kubernetes object. Created by BackupSchedule, kubectl create, or any other trigger source. Also materialized by the operator from the kopia catalog for snapshots it didn't produce (foreign or pre-install). |
BackupSchedule | Namespaced | Cron | When it runs: cron (with jitter + timezone) + configRef. Creates Backup CRs. |
Restore | Namespaced | Operation | A restore from a snapshot/identity to a PVC. Used directly, or referenced by PVC.spec.dataSourceRef. |
Maintenance | Namespaced | Lifecycle | One per Repository/ClusterRepository: schedules kopia maintenance run quick + full, manages ownership lease. |
Backup is also the single canonical representation of a kopia snapshot — both ones we produced and ones we discover in the repo. Three retention drivers cover the lifecycle:
BackupConfig.spec.retention(GFS —keepLatest/keepHourly/keepDaily/...) is the primary mechanism. The operator periodically computes the retention set for each(BackupConfig identity, source)tuple and deletesBackupCRs outside it. Each deleted CR'sdeletionPolicydetermines whether the underlying kopia snapshot goes with it. Details in §4.4.BackupSchedule.spec.failedJobsHistoryLimitbounds failedBackupCRs from a schedule (GFS doesn't apply to failures).Repository.spec.catalog.retain/ClusterRepository.spec.catalog.retainbounds theorigin: discoveredBackupCR set, keeping etcd footprint sane for large repos. DiscoveredBackups always havedeletionPolicy: Retainso this never deletes real snapshots (§4.5).
Manual Backup CRs (origin: manual with no schedule parent) are user-owned and not auto-GC'd; their snapshots are tied to their deletionPolicy.
This is the resolution of the onedr0p-vs-bo0tzz retention disagreement: GFS is the only successful-retention driver, and successfulJobsHistoryLimit does not exist on BackupSchedule. Failures use a flat count because GFS over failures is not meaningful.
2.2 Anchoring principles
- Repositories are objects, at both namespace and cluster scope. Identity, lifecycle, maintenance, and tenancy gating hang off them.
- Triggers are separable from recipes. A
BackupCR can be created by a schedule,kubectl create, Argo Events, or a Helm hook. The recipe (BackupConfig) never knows or cares. - GitOps "deploy-or-restore" is a first-class pattern. New cluster + existing repo + apply manifests → optionally restores latest snapshot before app starts.
- A
BackupCR owns the lifecycle of its kopia snapshot by default. Deleting the CR deletes the snapshot from the repository, governed by adeletionPolicyfield. Discovered backups are forced toRetainso the operator never deletes data it didn't create. - Restores are explicit. No silent "empty PVC because no snapshots existed yet" by default. The "deploy-or-restore" GitOps pattern is opt-in via a specific source mode +
onMissingSnapshot: Continue. - Maintenance is a first-class lifecycle concern, with its own CRD and explicit ownership lease.
- The mover is a thin Rust shim. A statically-linked Rust binary invokes
kopia --jsonand parses results. No 2,500-line bash scripts. The image carrieskopiaand the kopiur mover binary. See §4.10. - Validation is webhook-enforced. Mutually exclusive fields, missing repository references, malformed schedules, cross-tenant references — rejected at admission. Webhook is implemented with
kube-rs'saxum-based handler. - Identity is explicit and overridable. Defaults derive from object name/namespace; every component is overridable; the resolved identity always appears in
status. - Forward-compatible by construction. Every credential, policy, and rotation surface is a sub-object, so future fields slot in without API breakage (see §4.13).
- Type-safety end-to-end. Rust's enums +
serdediscriminators express CRDoneOfconstraints at compile time inside the controller. Any state the type system permits is a state the reconciler handles.
2.3 Where Backup CRs live
Same as the onedr0p draft:
| Origin | Namespace |
|---|---|
scheduled / manual — produced by us | The BackupConfig's namespace |
discovered — materialized from the kopia catalog | The Repository's namespace, or — for snapshots discovered under a ClusterRepository — the namespace named in the snapshot's identity, if it exists and is in the allowedNamespaces set. Falls back to a configurable catalog.fallbackNamespace otherwise. |
3. CRD Design
The full per-CRD field surface is identical to ADR-0001 (onedr0p) §3.1–§3.7. To keep this file readable, only the sections that differ from ADR-0001, or that need Rust-specific guidance, are reproduced here. Cross-references to ADR-0001 sections are by section number.
3.1 Repository
See ADR-0001 §3.1. No semantic changes.
Rust shape (sketch):
#![allow(unused)] fn main() { use kube::CustomResource; use schemars::JsonSchema; use serde::{Deserialize, Serialize}; #[derive(CustomResource, Serialize, Deserialize, Clone, Debug, JsonSchema)] #[kube( group = "kopiur.home-operations.com", version = "v1alpha1", kind = "Repository", namespaced, status = "RepositoryStatus", shortname = "kopiarepo", printcolumn = r#"{"name":"Phase","type":"string","jsonPath":".status.phase"}"#, printcolumn = r#"{"name":"Backend","type":"string","jsonPath":".spec.backend.kind"}"# )] #[serde(rename_all = "camelCase")] pub struct RepositorySpec { pub backend: Backend, pub encryption: Encryption, #[serde(default)] pub create: Option<CreateBehavior>, #[serde(default)] pub cache_defaults: Option<CacheDefaults>, #[serde(default)] pub catalog: Option<CatalogBounds>, } #[derive(Serialize, Deserialize, Clone, Debug, JsonSchema)] #[serde(tag = "kind", rename_all = "PascalCase")] pub enum Backend { S3(S3Backend), Azure(AzureBackend), Gcs(GcsBackend), B2(B2Backend), Filesystem(FilesystemBackend), Sftp(SftpBackend), WebDav(WebDavBackend), Rclone(RcloneBackend), } }
The #[serde(tag = "kind")] enum is what enforces the oneOf shape that ADR-0001 expressed via a JSON-schema rule and webhook check. In Rust it's a compile-time invariant: a deserialized Backend value is always exactly one variant. The webhook still validates content (bucket name format, credential secret reachability) but cannot receive a multi-variant value.
3.2 ClusterRepository
See ADR-0001 §3.2. Same shape; cluster-scoped via #[kube(... )] without the namespaced flag, plus the allowedNamespaces tenancy gate. The validating-admission webhook is the enforcement point for cross-namespace references — the controller never trusts that the API server pre-filtered them.
#![allow(unused)] fn main() { #[derive(CustomResource, Serialize, Deserialize, Clone, Debug, JsonSchema)] #[kube( group = "kopiur.home-operations.com", version = "v1alpha1", kind = "ClusterRepository", status = "ClusterRepositoryStatus", shortname = "kopiacrepo" )] pub struct ClusterRepositorySpec { // Same as RepositorySpec, plus: pub allowed_namespaces: AllowedNamespaces, #[serde(default)] pub identity_defaults: Option<IdentityTemplate>, } #[derive(Serialize, Deserialize, Clone, Debug, JsonSchema)] #[serde(rename_all = "camelCase")] pub enum AllowedNamespaces { List(Vec<String>), Selector(LabelSelector), All(bool), } }
3.3 – 3.7 BackupConfig, Backup, BackupSchedule, Restore, Maintenance
Field surface is identical to ADR-0001 §3.3–§3.7. The only Rust-specific note is that all CRD spec/status structs use #[derive(JsonSchema)] so the generated OpenAPI schema goes into the CRD manifest at build time (via kopium-style codegen for tests; via kube::Resource::api_resource() at runtime).
The discriminated unions (source.backupRef | fromConfig | identity on Restore, repository.kind = Repository | ClusterRepository on consumers, target.pvc | pvcRef on Restore, Backend on Repository) are all #[serde(tag = "kind")] or untagged-with-fallback enums in Rust. The webhook validates inter-field constraints that can't be expressed in the type system (e.g. "if kind: ClusterRepository, then namespace field is forbidden, and the consumer's namespace must be in allowedNamespaces").
4. Key behaviors
4.1 Scheduling
See ADR-0001 §4.1. Cron implementation in Rust uses croner (POSIX cron + extensions) wrapped in a tokio interval task per BackupSchedule. H jitter substitution is computed deterministically from (scheduleUID, slot_start) so retries hit the same wall-clock slot.
Anchor is wall-clock (cron(now)), not cron(lastSyncTime) — fixes volsync's drift. Pinned scheduledAt lets ops alerts say "you missed the 02:13 slot" without ambiguity.
4.2 Identity model
See ADR-0001 §4.2. Identity templates are rendered with tera (Jinja2-compatible) at admission; the resolved identity is pinned to status.resolved.identity and never re-rendered after admission.
4.3 Repository sharing
See ADR-0001 §4.3.
4.4 Retention enforcement
See ADR-0001 §4.4. GFS-only. Failed-job-count is a separate flat bound on the BackupSchedule.
4.5 Backup deletion semantics
See ADR-0001 §4.5. Adopted in full. This is one of the two big places where the onedr0p draft is meaningfully better than bo0tzz's — defaulting to "deleting the CR deletes the snapshot" matches user expectations established by Kubernetes finalizer semantics elsewhere (e.g. PersistentVolumeClaim deletion deleting the underlying volume if reclaimPolicy: Delete).
deletionPolicy enum:
#![allow(unused)] fn main() { #[derive(Serialize, Deserialize, Clone, Copy, Debug, JsonSchema, PartialEq, Eq)] pub enum DeletionPolicy { /// Default for `origin: scheduled` and `origin: manual`. Finalizer runs /// `kopia snapshot delete <id>` then removes the finalizer. Delete, /// Default for `origin: discovered`. CR is removed; snapshot stays. /// Forced via webhook for discovered backups; cannot be overridden. Retain, /// CR is removed without contacting the repository at all (escape hatch /// for "the bucket is gone, just let me delete the CR"). Status records /// `orphaned: true` for the snapshot ID before removal. Orphan, } }
The reconciler distinguishes the three cases with an exhaustive match — Rust enforces that any new variant added later must be handled in every match site, preventing the class of bug where a new policy slips into production without a corresponding reconcile branch.
4.6 Restore resolution & semantics
See ADR-0001 §4.6. Three source modes (backupRef, fromConfig, identity) are an enum, validated at admission, pinned at status.
4.7 Volume populator
See ADR-0001 §4.7.
4.8 Hooks
See ADR-0001 §4.8. Hooks run in the workload pod via kubectl exec-equivalent (the controller uses kube::api::AttachParams), not in the mover. Resolves G13.
4.9 Multi-PVC consistency
See ADR-0001 §4.9.
4.10 Mover pods & failure handling
The mover is a statically-linked Rust binary built with --target x86_64-unknown-linux-musl (and aarch64-unknown-linux-musl for ARM). It is roughly 8 MB. The container image is built on gcr.io/distroless/static-debian12:nonroot plus the kopia binary from the official release, totaling ~70 MB.
The mover binary:
- Reads its work spec from a downward-API-mounted JSON file (the controller writes a
ConfigMapperBackup/Restorerun with the resolved identity, paths, hook plan, options). - Invokes
kopia --jsonand streams output through aserde_jsonDeserializer::from_reader. - Reports progress every 5 s via a
PATCHto theBackup.statussubresource usingkube::Api::patch_status. - On terminal failure, writes a structured
status.failureblock (kopia error class, last stderr lines, retry recommendation) and exits non-zero.
Image size, startup time (<200 ms cold start in a fresh pod), and memory footprint (resident ~12 MB before kopia subprocess) are all materially better than the Go equivalent. None of those are decisive for a backup workload, but they make the operator cheap to colocate on a small cluster — which matters for the homelab/SMB segment the project is targeted at.
backoffLimit and activeDeadlineSeconds are passed through to the Job template. Mover pods carry a finalizer that the controller clears once status is read; this fixes G18 (zombie pods).
4.11 Forward compatibility
See ADR-0001 §4.13. Same sub-object discipline.
4.12 Security & RBAC
See ADR-0001 §4.11. Controller RBAC is generated from the kube-rs Resource traits with a build-time cargo xtask gen-rbac task. Mover pods use a per-namespace ServiceAccount minted by the controller with PVC-read or PVC-write scoped to the specific PVC name; no namespace-wide PVC permissions.
4.13 Observability
See ADR-0001 §4.10. Metrics emitted via prometheus crate. Controller exports the standard kube-rs reconcile metrics (controller_reconciliations_total, controller_reconcile_duration_seconds) plus per-CRD business metrics (kopia_backup_bytes_total, kopia_repo_size_bytes, kopia_maintenance_last_success_timestamp_seconds).
Tracing via tracing + tracing-subscriber with OTLP export. Every reconcile is a single tracing::Span keyed by (kind, namespace, name, generation); child spans cover kopia subprocess invocations, webhook calls, and finalizer steps.
5. Implementation surface (Rust-specific)
This is the section that doesn't exist in either predecessor ADR. It's the load-bearing reason we're choosing Rust over Go — if these claims don't hold up, we should revisit the language choice before writing more code.
5.1 Crate layout
kopiur/
├── Cargo.toml # workspace
├── crates/
│ ├── api/ # CRD types, JsonSchema derive, no controller deps
│ ├── controller/ # reconcilers, owned-resource indexes, finalizers
│ ├── webhook/ # axum admission webhook server
│ ├── mover/ # the per-Backup/Restore Job binary
│ └── xtask/ # codegen: CRD YAML, RBAC YAML, helm values schema
└── deploy/
├── crds/ # generated, checked in
├── helm/ # operator + webhook + namespace install
└── examples/
Splitting api from controller matters more than it sounds: downstream Rust users (a custom backup-triggering controller, a CI tool that lints BackupConfig manifests) can take a dependency on kopiur-api without pulling in tokio, kube::Client, or any of the controller runtime. This is the Rust equivalent of Kubernetes Go's apimachinery-vs-controller-runtime split, and kube-rs makes it natural.
5.2 Controller runtime
kube::runtime::Controller per top-level CRD (Repository, ClusterRepository, BackupConfig, Backup, BackupSchedule, Restore, Maintenance). Each Controller owns its Api<T> plus an owned-resource watch for the types it manages:
BackupSchedulewatchesBackup(owner-ref) to recompute next-scheduled-slot fromstatus.lastSuccessfulBackup.BackupConfigwatchesBackupto enforce GFS retention.Repository/ClusterRepositorywatchesBackupoforigin: discoveredto materialize/expire catalog rows.Restorewatches the targetPVCfor populator handshake completion.
Reconcile errors return kube::runtime::controller::Action::requeue(duration) with exponential backoff (clamped at 5 minutes). The error_policy closure logs the error, increments controller_reconcile_errors_total, and chooses requeue interval based on error kind (transient kopia / API server / webhook outage → 30 s; structural CRD bug → 5 min).
5.3 Webhook
axum 0.7 on tokio with rustls. Certificate management via cert-manager Certificate CR (helm chart provisions it). The webhook handler is one async function per resource that calls into kopiur-api's validators — same code path the controller would use to sanity-check before reconcile, so behavior is consistent.
5.4 kopia interaction
Subprocess via tokio::process::Command. JSON output streamed line-by-line (tokio::io::BufReader::lines) and parsed with serde_json::from_str into kopia-defined types (kopia-cli-types sub-crate; manually maintained against kopia's stable CLI JSON output, regenerated when kopia releases new fields).
Long-running snapshot/restore subprocesses are managed by the mover pod, not the controller. The controller never spawns kopia directly except for short, idempotent operations: kopia repository connect --json to validate a Repository, kopia snapshot list --json to materialize the catalog. These run as short-lived Jobs, not in-process, so a controller restart doesn't strand a kopia process.
5.5 Why Rust, concretely
This is the section that has to justify Rust over Go for a maintainer reading the ADR cold. The candid version:
| Property | Rust + kube-rs | Go + controller-runtime |
|---|---|---|
| Discriminated-union safety | Native (enum, exhaustive match). Compile-time guarantee that every variant is handled. | Tagged structs + oneof validation in webhook. Runtime check only. |
| Memory footprint (controller) | ~50 MB resident at idle in profiling builds | ~150 MB typical for a controller-runtime binary at idle |
| Mover image size | ~70 MB (distroless + kopia + 8 MB Rust binary) | ~120 MB (distroless + kopia + 35 MB Go binary) |
| Ecosystem maturity | kube-rs is production-grade and used by CRI-O, Stackable, Linkerd's Rust components, the Volsync rust-mover-shim experiments | controller-runtime is older, larger, more battle-tested across a wider population |
| Hiring pool | Smaller. Notable. | Larger. |
| CRD codegen | schemars derive produces JSON Schema directly from the spec struct | controller-gen does the same; both work; Rust's is slightly tighter (no +kubebuilder: magic comments) |
| Reconciler ergonomics | async fn reconcile with ? operator for error propagation, tokio::select! for cancellation | Function returning (Result, error); cancellation via context.Context |
| Test ergonomics | kube::Client::try_default() against kind + serial_test; first-class Mock clients via tower::ServiceExt | Same pattern via envtest; arguably more mature |
The hiring-pool concern is real but the project's likely contributor base (the kubesearch / homelab-ops / self-hosted-k8s community that's already running the perfectra1n volsync fork) skews higher Rust-literate than typical, and the maintainer set this project would actually ship with is one or two people, not a team of ten.
The exhaustiveness guarantee is the load-bearing argument. Backup software has the highest "wrong answers are catastrophic" coefficient of any controller class — a controller that silently does nothing because a new enum variant slipped past a switch block can lose user data. Rust prevents that class of bug at the type level; Go cannot.
6. Usage walkthroughs
See ADR-0001 §5 — all walkthroughs there apply unchanged. The CRDs are language-agnostic; only the controller binary is Rust.
The walkthroughs in ADR-0001 §5.1 through §5.9 cover:
- Single PVC, scheduled daily
- Shared platform repository (
ClusterRepository) - Restore by picking a backup
- Multi-PVC selector
- Deploy-or-restore (GitOps)
- Manual one-shot backup
- Restore from a discovered (foreign / pre-install) backup
- Forcing CR removal when the repo is offline
- Suspending a schedule via GitOps
7. Consequences
7.1 Positive
- Single mover, native CRDs — no abstraction tax (G19).
- Repository as a Kubernetes resource (G1); cluster-scoped option for platform teams.
- Trigger separation (G17) unlocks Argo Events / Helm hook /
kubectl createpaths. - GFS retention surfaced at the recipe level; failures bounded separately (G6).
- Fail-closed restore default (G7) with explicit deploy-or-restore opt-in.
- Discoverable snapshot catalog (G8/G9) — restores are "pick a row," not "construct a timestamp."
BackupCR lifecycle owns kopia snapshot lifecycle by default (G20); discovered snapshots cannot be deleted by the operator (G20 + safety).- Maintenance is a first-class CRD (G11) with an explicit ownership lease.
- Rust controller surface gives compile-time exhaustiveness on enum handling (G21) — the single largest class of "controller silently drops data" bug becomes impossible.
- Lower resource footprint than a Go equivalent matters at the homelab/SMB tier the project is aimed at.
7.2 Negative / trade-offs
- Larger blast radius if a controller bug ships:
deletionPolicy: Deleteis the default for produced backups, so a buggy GC could delete real snapshots. Mitigated by: (a) finalizer-mediated deletion only after status validates; (b) discovered backups forced toRetain; (c)kopia maintenanceseparates content from manifests, so a deleted snapshot is recoverable until the next full maintenance. - Webhook is in the failure path of every CR write. Failure-mode is "fail-closed" via
failurePolicy: Failfor safety-critical fields andIgnorefor soft validators. - Identity-model exposure is more upfront learning than volsync's "you don't need to know." Acceptable cost — kopia's identity model is the operator's defining shape.
- Rust hiring pool is smaller than Go's. Acceptable for a project of this size and contributor profile.
kube-rsecosystem, while production-grade, has fewer "I'll grab a snippet from Stack Overflow" answers than controller-runtime. Documentation discipline matters more.ClusterRepositoryadds one more concept to learn. We accept this because the shared-repo use case is real and important (platform teams running a backup tier across many tenant namespaces).
8. Deferred / open questions
- Cron library choice.
cronervstokio-cron-schedulervs hand-rolled. Decision deferred to a follow-up ADR once webhook is feature-complete and we know the actual schedule volume per cluster. - CRD versioning strategy.
v1alpha1→v1beta1→v1cadence. Conversion webhooks viakube-rsare supported but unergonomic; we may pinv1alpha1for longer than typical and bundle breaking changes into a single v1beta1 cutover. - Multi-tenant maintenance scheduling. When two
BackupConfigs in different namespaces share aClusterRepository, who owns the maintenance lease? Current proposal: a singleMaintenanceCR in thekopia-systemnamespace perClusterRepository, written by the platform admin. Open to alternatives. - Restic interop / migration tooling. Out of scope for v1alpha1. Likely a one-shot
kopiur-migratebinary, not a CRD. - Status subresource bandwidth. Mover pods reporting progress every 5 s via
PATCHcould be heavy on large clusters. Defer optimization to v1beta1 if metrics show it matters.
9. References
Predecessor ADRs (in this repo)
docs/adr/0001-onedr0p-kopia-operator.md— fuller draft withClusterRepository, deletion semantics, GFS-driven retention. This ADR adopts its CRD surface wholesale.docs/adr/0002-bo0tzz-kopia-operator.md— leaner draft, 5 CRDs, simpler retention. This ADR keeps its anchoring-principles clarity and CRD-count-first framing.
External
backube/volsync— upstreamperfectra1n/volsync— kopia forkbackube/volsync#1723— kopia mover PRbackube/volsync#1559— trigger redesignkube-rs/kube— implementation framework- Kopia documentation — repository model, identity, maintenance
- CloudNativePG —
Cluster/ScheduledBackup/Backupseparation - Tekton —
Task/TaskRunseparation
Appendix A: Field-by-field comparison vs volsync
See ADR-0001 Appendix A. No changes — comparison is between CRD shapes, not implementation language.
ADR-0001: A Kopia-Native Backup Operator for Kubernetes
- Status: Proposed
- Date: 2026-05-24
- Inspired by:
backube/volsyncand the kopia forkperfectra1n/volsync(especially PRbackube/volsync#1723and the trigger-redesign proposalbackube/volsync#1559). The triggering model also draws on CloudNativePG (Cluster/ScheduledBackup/Backup) and Tekton (Task/TaskRun).
Scope: this ADR covers CRD shape, user experience, and high-level design choices. It deliberately does not specify Go package layout, controller-runtime indexes, leader-election lease IDs, the cron library, or other implementation mechanics — those belong to follow-up ADRs once the API surface is agreed.
1. Context
VolSync is the de-facto Kubernetes-native mover for PVCs. Its design is mature and battle-tested, but it has accreted around restic's model. As soon as you try to add a non-restic mover (kopia, rustic, borg, …) several deep design choices push back. The community fork perfectra1n/volsync proves out a kopia mover and ships a usable image — but its PR has been open ~13 months without merging, the upstream maintainers are capacity-constrained, and many users have switched to running the fork in production.
The fork's existence and the volume of feature requests around kopia/restic locking, multi-PVC backup, scheduling jitter, restore UX, trigger separation, snapshot lifecycle, and "stop running on apply" suggest something stronger than "land kopia in volsync" is warranted. A kopia-native operator can:
- Drop the multi-mover abstraction entirely. Kopia is the only mover, so every CRD field can be expressive without leaking through a generic shape.
- Make a repository a first-class Kubernetes resource — at both namespace and cluster scope. Kopia repos are designed to be shared across many writers, including across namespaces.
- Separate recipe, invocation, and schedule so backups can be triggered by any source (cron,
kubectl create, Argo Events, button-in-Grafana). Volsync'striggerfield couples all three. - Use kopia's native identity model (
username@hostname:path) deliberately rather than as an accident ofmetadata.name/metadata.namespace. - Treat
kopia maintenanceand snapshot lifecycle as first-class operator concerns rather than retrofits. - Tie the lifecycle of a Kopia snapshot to the lifecycle of its
BackupCR by default, with explicit opt-outs — addressing the persistent volsync confusion that deleting aReplicationSourcehas no effect on snapshots in the repository. - Surface kopia's snapshot catalog through CRDs so restore is "browse and reference," not "construct an
restoreAsOftimestamp and hope." - Address the long backlog of papercuts as design decisions, not bug fixes.
We refer to the project as kopia-operator in this document; final naming is out of scope. The API group is kopia.io with initial version v1alpha1.
1.1 The most important gaps we are addressing
| # | Gap | volsync refs |
|---|---|---|
| G1 | Repository is not a Kubernetes resource; cannot be shared/reused cleanly | implicit; perfectra1n CRD shape |
| G2 | One ReplicationSource = one PVC | #1115, #1116, #320 |
| G3 | First reconcile triggers an immediate backup, no GitOps-friendly "skip first run" | #627 |
| G4 | No cron jitter / H substitution, no timezone | #1421, #702 |
| G5 | Restic repo locking / piling-up jobs | #1042, #1429, #646 |
| G6 | No retry-limit / backoffLimit override | #1228, #1042 |
| G7 | Restore proceeds with empty PVC if no snapshot found | #1211 |
| G8 | Snapshot selection is restic-format restoreAsOf only; no browse | #7, #1211 |
| G9 | latestImage always wins — no immutable restore source | disc #1115 |
| G10 | Volume populator + Direct copyMethod incompatibility | disc #1115, #1129 |
| G11 | Maintenance ownership is implicit & runs in the same pod as backup | perfectra1n fork redesigned this three times |
| G12 | Policy passthrough is brittle: every kopia knob needs CRD/jq script changes | fork #13, #23 |
| G13 | Snapshot actions run in mover, not workload | fork #22 |
| G14 | OOMs unpredictable; no resource guidance | #626, #707, #1228 |
| G15 | Mover image is :latest by default | volsync restic/builder.go:42 |
| G16 | Restricted PSA / OpenShift SCC / unprivileged-mode lost+found papercuts | #367, #1033, #1889, #1430 |
| G17 | Trigger semantics are baked into the source CR — no manual/external trigger path | #1559 |
| G18 | Mover-pod lifecycle (zombie pods, stuck jobs) | fork #8, volsync #1415 |
| G19 | Maintainers' explicit door-closing on new movers | #1743, #1029, #320 |
| G20 | Deleting the source CR doesn't delete snapshots from the repository | implicit |
2. Decision
2.1 Topology
Seven CRDs in kopia.io/v1alpha1. Six are namespaced; ClusterRepository is cluster-scoped.
| CRD | Scope | Layer | Purpose |
|---|---|---|---|
Repository | Namespaced | Storage | A kopia repository owned by one namespace: credentials, backend, encryption, optional catalog-materialization bounds. Many BackupConfigs/Restores reference one. |
ClusterRepository | Cluster | Storage | A shared kopia repository operated by the platform team, referenceable from allow-listed namespaces. Identity defaults are templated per consumer namespace. |
BackupConfig | Namespaced | Recipe | What to back up: PVC selector, identity, retention, policy, hooks. Idempotent — doesn't run anything on its own. |
Backup | Namespaced | Invocation + Catalog | A single kopia snapshot as a Kubernetes object. Created by BackupSchedule, kubectl create, or any other trigger source. Also materialized by the operator from the kopia catalog for snapshots it didn't produce (foreign or pre-install). |
BackupSchedule | Namespaced | Cron | When it runs: cron (with jitter + timezone) + configRef. Creates Backup CRs. |
Restore | Namespaced | Operation | A restore from a snapshot/identity to a PVC. Used directly, or referenced by PVC.spec.dataSourceRef. |
Maintenance | Namespaced | Lifecycle | One per Repository/ClusterRepository: schedules kopia maintenance run quick + full, manages ownership lease. |
The three-layer split (recipe / invocation / schedule) for backups is the deliberate response to volsync #1559. It means:
- A
Backupcan be created from anywhere —kubectl create, Argo Events, a Tekton pipeline, a webhook handler. - A
BackupScheduleis just one source ofBackupCRs. Removing or pausing a schedule does not affect already-running or already-completed runs. - A
BackupConfigchange applies to subsequent invocations; the operator snapshots resolved values into eachBackup.status.resolved...for traceability.
Backup is also the single canonical representation of a kopia snapshot — both ones we produced and ones we discover in the repo. Three retention drivers cover the lifecycle:
BackupConfig.spec.retention(GFS —keepLatest/keepHourly/keepDaily/...) is the primary mechanism. The operator periodically computes the retention set for each(BackupConfig identity, source)tuple and deletesBackupCRs outside it. Each deleted CR'sdeletionPolicydetermines whether the underlying kopia snapshot goes with it. Details in §4.4.BackupSchedule.spec.failedJobsHistoryLimitbounds failedBackupCRs from a schedule (GFS doesn't apply to failures).Repository.spec.catalog.retain/ClusterRepository.spec.catalog.retainbounds theorigin: discoveredBackupCR set, keeping etcd footprint sane for large repos. DiscoveredBackups always havedeletionPolicy: Retainso this never deletes real snapshots (§4.5).
Manual Backup CRs (origin: manual with no schedule parent) are user-owned and not auto-GC'd; their snapshots are tied to their deletionPolicy.
Dedup key is (Repository.UID, kopiaSnapshotID) — the operator will not create a discovered Backup for a snapshot already represented by an operator-initiated one.
Restore stays as a single CR (it's an operation, not a recurring thing). For the dataSourceRef-driven populator pattern, a Restore is left in passive mode (no target) and consumed by zero-or-more PVCs.
2.2 Anchoring principles
- Repositories are objects, at both namespace and cluster scope. Identity, lifecycle, maintenance, and tenancy gating hang off them.
- Triggering is decoupled.
BackupConfigsays what;Backupsays that;BackupSchedulesays when. Any of the three can be authored or automated independently. - A
Backupis a kopia snapshot. Operator-initiated, manually-applied, and discovered snapshots are all the same kind. - A
BackupCR owns the lifecycle of its kopia snapshot by default. Deleting the CR deletes the snapshot from the repository, governed by adeletionPolicyfield. Discovered backups are forced toRetainso the operator never deletes data it didn't create. - Restores are explicit. No silent "empty PVC because no snapshots existed yet" by default. The "deploy-or-restore" GitOps pattern is opt-in via a specific source mode +
onMissingSnapshot: Continue. - Maintenance is a first-class lifecycle concern, with its own CRD and explicit ownership lease.
- The mover is a thin shim. A Go-native controller invokes
kopia --jsonand parses results. No 2,500-line bash scripts. The image carrieskopiaand nothing else. - Validation is webhook-enforced. Mutually exclusive fields, missing repository references, malformed schedules, cross-tenant references — rejected at admission.
- Identity is explicit and overridable. Defaults derive from object name/namespace; every component is overridable; the resolved identity always appears in
status. - Forward-compatible by construction. Every credential, policy, and rotation surface is a sub-object, so future fields slot in without API breakage (see §4.13).
2.3 Where Backup CRs live
| Origin | Namespace |
|---|---|
operator — created by BackupSchedule | The BackupConfig's namespace (so the owning team sees their backups with kubectl get backup -n <team>). |
manual — created by kubectl create or external automation | Whichever namespace the user applies it to. The configRef may cross namespaces, subject to RBAC. |
discovered — materialized from the kopia catalog | The Repository's namespace, or — for snapshots discovered under a ClusterRepository — the namespace named in the snapshot's identity, if it exists and is in the allowedNamespaces set. Falls back to a configurable catalog.fallbackNamespace otherwise. |
Restore.spec.source.backupRef carries { name, namespace } for cross-namespace references.
3. CRD Design
3.1 Repository
Owns credentials, encryption, and repository-wide settings for a single namespace. Catalog materialization for discovered Backup CRs is configured here.
apiVersion: kopia.io/v1alpha1
kind: Repository
metadata:
name: nas-primary
namespace: backups
spec:
# Exactly one backend block. Webhook-enforced.
backend:
s3:
bucket: my-backups
prefix: clusters/prod/
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
auth:
secretRef:
name: nas-primary-creds # keys: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, ...
# Optional advanced auth — workloadIdentity supported but not the default.
# workloadIdentity:
# serviceAccountName: kopia-s3
tls:
caBundleRef:
configMapName: corp-ca
key: ca.crt
insecureSkipVerify: false
encryption:
passwordSecretRef: # always a Secret ref; never inline
name: nas-primary-creds
key: KOPIA_PASSWORD
# Future fields (rotation, previousPasswords, ...) slot in here.
create:
enabled: true # if repo missing, create it
encryption: AES256-GCM-HMAC-SHA256
splitter: DYNAMIC-4M-BUZHASH
hash: BLAKE3-256
cacheDefaults: # inherited by Backup/Restore unless overridden
capacity: 8Gi
storageClassName: fast-ssd
metadataCacheSizeMB: 5000
contentCacheSizeMB: 2000
catalog: # bounds materialization of `origin: discovered` Backup CRs
retain:
perIdentity: 100 # most recent N per username@hostname:path
maxAgeDays: 90 # nothing older than this gets a Backup CR
refreshInterval: 5m
# Older snapshots remain in kopia; restorable via Restore.source.identity.snapshotID
status:
phase: Ready # Pending | Initializing | Ready | Degraded | Failed
observedGeneration: 7
uniqueID: "fb6e...c41a" # kopia repo unique ID
conditions:
- type: Connected
status: "True"
reason: ConnectFromConfig
- type: MaintenanceOwned
status: "True"
message: "kopia-operator/nas-primary"
storageStats:
snapshotCount: 1284
totalSize: 412Gi
lastObservedAt: 2026-05-24T17:00:01Z
catalog:
discoveredBackupCount: 412 # how many Backup CRs materialized from the catalog scan
lastRefreshAt: 2026-05-24T17:01:11Z
Why: addresses G1 (repo as a resource), G15 (digest pinning belongs on the operator image, not embedded per recipe), and provides the catalog-bounds knob that keeps Backup CRs from blowing up etcd while still giving the K8s-native view of kopia history. encryption is a sub-object so future rotation fields fit without API breakage (§4.13).
3.2 ClusterRepository
The cluster-scoped counterpart for shared infrastructure repositories operated by a platform team. Same spec surface as Repository, plus tenancy gating and per-namespace identity templating.
apiVersion: kopia.io/v1alpha1
kind: ClusterRepository
metadata:
name: shared-primary
spec:
# Same backend/encryption/cacheDefaults/create/catalog blocks as Repository.
backend:
s3:
bucket: org-kopia-repo
prefix: "" # bucket root maximizes dedup across tenants
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
auth:
secretRef:
name: kopia-platform-creds
namespace: kopia-system # REQUIRED on cluster-scoped CRs
encryption:
passwordSecretRef:
name: kopia-platform-creds
namespace: kopia-system # REQUIRED
key: KOPIA_PASSWORD
create:
enabled: true
encryption: AES256-GCM-HMAC-SHA256
# Tenancy gate — webhook-enforced on every consumer CR.
allowedNamespaces:
# Exactly one of:
list: [production, staging, billing]
# selector:
# matchLabels: { kopia.io/tier: enterprise }
# all: true
# Identity defaults applied when consumers don't override.
identityDefaults:
hostnameTemplate: "{{ .Namespace }}"
usernameTemplate: "{{ .Namespace }}-{{ .ConfigName }}"
catalog:
retain:
perIdentity: 50
maxAgeDays: 60
refreshInterval: 5m
# Where to materialize discovered Backup CRs whose identity hostname
# does not match an allowed namespace.
fallbackNamespace: kopia-system
status:
phase: Ready
uniqueID: "0a91...8a3f"
allowedNamespaceCount: 3
conditions:
- type: Connected
status: "True"
- type: TenancyEnforced
status: "True"
Consumer CRDs (BackupConfig, Backup, Restore, Maintenance) accept a discriminated repository reference:
repository:
kind: ClusterRepository # Repository (default) | ClusterRepository
name: shared-primary
# namespace: ... # ignored when kind=ClusterRepository
The validating admission webhook rejects a consumer CR whose namespace is not in the ClusterRepository.spec.allowedNamespaces set. This avoids the "secret accessible from any namespace" anti-pattern and gives platform teams a single object tenants can't shadow.
Why: the cross-namespace Repository ref pattern covers most cases but has two real shortcomings — tenants can create their own Repository with the same name as a platform one (no shadow protection), and tenancy is expressed in RBAC rules rather than as a first-class allow list. ClusterRepository fixes both. The shared-prefix backend layout (prefix: "") also maximizes deduplication across all tenant namespaces, which is the operational reason platform teams want a shared repo in the first place.
3.3 BackupConfig
The recipe. Idempotent. Apply once; reference from many Backups or one BackupSchedule.
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata:
name: postgres-data
namespace: billing
spec:
repository:
kind: Repository # Repository | ClusterRepository
name: nas-primary
namespace: backups # cross-ns Repository; ignored for ClusterRepository
# Identity — what kopia sees. Defaults shown.
# For ClusterRepository consumers, the repository's identityDefaults templates apply
# unless overridden here.
identity:
username: "postgres-data" # default: <BackupConfig.metadata.name>
hostname: "billing" # default: <BackupConfig.metadata.namespace>
# Sources — what to back up.
sources:
- pvc: { name: postgres-data }
sourcePathOverride: /data # what kopia records (default: /pvc/<name>)
# Or a selector for multi-PVC:
# - pvcSelector:
# namespaceSelector: { matchNames: [billing, billing-staging] }
# labelSelector: { matchLabels: { backup: include } }
# sourcePathStrategy: PVCName # PVCName | PVCNamespacedName
copyMethod: Snapshot # Snapshot (default, PiT) | Clone | Direct
volumeSnapshotClassName: csi-snap-class
groupBy: VolumeGroupSnapshot # default for multi-PVC sources; None opts into per-PVC
retention: # GFS — enforced by operator pruning Backup CRs (§4.4)
keepLatest: 10
keepHourly: 24
keepDaily: 14
keepWeekly: 8
keepMonthly: 12
keepAnnual: 5
# Default deletion policy for Backup CRs created against this config.
# Per-Backup override available on Backup.spec.deletionPolicy.
defaultDeletionPolicy: Delete # Delete | Retain | Orphan
policy: # typed fields — not opaque JSON parsed by jq
compression:
compressor: zstd
neverCompress: ["*.zip", "*.gz", "*.mp4"]
splitter: DYNAMIC-4M-BUZHASH
ignore:
paths: ["*.tmp", "*/cache/*", "lost+found"]
cacheDirs: true # honor CACHEDIR.TAG
ignoreIdenticalSnapshots: true # fork issue #13
extraArgs: [] # escape hatch for kopia flags we don't model yet
hooks: # G13 — runs in the workload, not the mover
beforeSnapshot:
- workloadExec:
podSelector: { matchLabels: { app: postgres } }
container: postgres
command: ["pg_start_backup", "snap"]
timeout: 2m
afterSnapshot:
- workloadExec:
podSelector: { matchLabels: { app: postgres } }
container: postgres
command: ["pg_stop_backup"]
timeout: 2m
mover: # per-recipe overrides
resources:
requests: { cpu: 250m, memory: 512Mi }
limits: { cpu: "2", memory: 4Gi }
cache:
capacity: 16Gi
storageClassName: fast-ssd
securityContext: {} # override; default: nonRoot uid 65534
# privilegedMode: true # opt-in, namespace-gated; preserves UID/GID on restore
# inheritSecurityContextFrom: # opt-in: copy SC from a live workload pod
# podSelector: { matchLabels: { app: postgres } }
status:
resolved: # what would be passed to kopia
identity:
username: "postgres-data"
hostname: "billing"
sources:
- pvc: billing/postgres-data
sourcePath: /data
retention:
activeBackupCount: 47 # CRs currently inside the GFS window
lastPruneAt: 2026-05-24T03:00:00Z
lastPruneDeleted: 2
conditions:
- type: RepositoryReachable
status: "True"
- type: GroupSnapshotSupported
status: "True"
Why: addresses G2 (selector + VolumeGroupSnapshot default), G12 (typed policy + escape hatch), G13 (hook types), G14 (explicit resource defaults), G16 (security-context controls without forcing privileged-by-default). The identity sub-object makes the second-biggest perfectra1n papercut (fork #7) impossible.
3.4 Backup
A single kopia snapshot as a Kubernetes object. Three origins:
operator— created by aBackupSchedule. Spec hasconfigRef; lives in theBackupConfig's namespace.manual— created bykubectl createor external automation. Spec hasconfigRef; lives wherever the user applied it.discovered— materialized by the operator's catalog scan for snapshots it didn't produce. Spec is empty/absent; lives in theRepository's namespace (see §2.3).
apiVersion: kopia.io/v1alpha1
kind: Backup
metadata:
name: postgres-data-20260524-021300
namespace: billing
finalizers:
- kopia.io/snapshot-cleanup # §4.5
labels:
# Operator-managed labels — canonical values live in status; these mirror
# for kubectl-selectability.
kopia.io/repository: nas-primary
kopia.io/backup-config: postgres-data
kopia.io/origin: operator
kopia.io/identity-hash: "a3f1..."
spec:
# Operator-initiated and manual: configRef + optional overrides.
# Discovered: spec is empty/absent.
configRef: { name: postgres-data }
tags:
reason: "scheduled-nightly"
# parameters: # optional per-run overrides on the recipe
# compressionOverride: none
failurePolicy: # G6 — per-run, not hard-coded
backoffLimit: 2
activeDeadlineSeconds: 7200
# Lifecycle of the underlying kopia snapshot when this CR is deleted.
# Defaults are origin-aware (§4.5):
# operator: Delete (or inherits BackupConfig.spec.defaultDeletionPolicy)
# manual: Delete (or inherits BackupConfig.spec.defaultDeletionPolicy)
# discovered: Retain (FORCED — webhook rejects other values)
deletionPolicy: Delete # Delete | Retain | Orphan
status:
phase: Succeeded # Pending | Running | Succeeded | Failed | Deleting | Discovered
origin: operator # operator | manual | discovered — canonical
snapshot: # the kopia artifact
kopiaSnapshotID: k1f1ec0a8
identity:
username: "postgres-data"
hostname: "billing"
sourcePath: /data
timing:
startTime: 2026-05-24T02:13:00Z
endTime: 2026-05-24T02:18:42Z
durationSeconds: 342
stats: # populated from kopia's JSON output
sizeBytes: 4321098765
bytesNew: 12345678
filesNew: 1233
filesModified: 22
filesUnchanged: 998111
job: # operator/manual only; absent for discovered
name: backup-postgres-data-20260524-021300
attempts: 1
resolved: # frozen recipe values at run time (operator/manual)
repository: { kind: Repository, name: nas-primary, namespace: backups }
sources:
- pvc: billing/postgres-data
sourcePath: /data
conditions:
- type: SourcesQuiesced
status: "True"
- type: SnapshotCreated
status: "True"
logTail: | # capped at ~4KB; full logs in the Job pod
Snapshot created: k1f1ec0a8
Total bytes: 4321098765
kubectl get backup shows everything — runs in flight, historical successes, failed attempts, and the discovered catalog — distinguished by the kopia.io/origin label and status.phase.
Spec immutability. The validating webhook freezes spec once status.phase != Pending, with two exceptions:
spec.deletionPolicyandspec.failurePolicyremain editable post-completion (users may decide after the fact to retain a snapshot, or extend a retry budget).- Discovered
Backups have no spec to mutate; onlydeletionPolicy: Retainis permitted via the webhook.
Why: addresses G17 (invocations as first-class, any trigger source) and G20 (snapshot lifecycle = CR lifecycle, configurable). Folds in the catalog representation cleanly; restores reference one kind of thing. Logs are bounded; full logs live in the Job pod where users expect them.
3.5 BackupSchedule
Creates Backup CRs on a schedule in the BackupConfig's namespace.
apiVersion: kopia.io/v1alpha1
kind: BackupSchedule
metadata:
name: postgres-data-nightly
namespace: billing
spec:
configRef:
name: postgres-data
schedule:
cron: "H 2 * * *" # G4 — Jenkins-style 'H' substitution
jitter: 30m # deterministic; see §4.1
timezone: "America/Los_Angeles"
runOnCreate: false # G3 — GitOps-friendly default
suspend: false
concurrencyPolicy: Forbid # Forbid | Allow | Replace — G18
startingDeadlineSeconds: 600
failedJobsHistoryLimit: 3 # successful Backup retention is governed by
# BackupConfig.spec.retention (§4.4)
status:
lastSchedule:
scheduledAt: 2026-05-24T02:13:00Z # cron + jitter, pinned (predictable for alerting)
backupRef: { name: postgres-data-20260524-021300 }
nextSchedule:
at: 2026-05-25T02:21:00Z
lastSuccessfulSchedule:
at: 2026-05-24T02:13:00Z
backupRef: { name: postgres-data-20260524-021300 }
consecutiveFailures: 0
conditions:
- type: ConfigResolvable
status: "True"
Note the deliberate absence of successfulJobsHistoryLimit: successful retention is GFS-driven on BackupConfig.spec.retention, not flat-count on the schedule. See §4.4 for the rationale.
Why: mirrors CronJob semantics for the parts that matter (G4, G18). Schedule anchoring is wall-clock (cron(now)), not cron(lastSyncTime) — fixes volsync's drift behavior. Pinned scheduledAt lets ops alerts say "you missed the 02:13 slot" without ambiguity.
3.6 Restore
A restore from a Backup (or raw kopia identity) to a PVC.
apiVersion: kopia.io/v1alpha1
kind: Restore
metadata:
name: postgres-restore-2026-05-23
namespace: billing
spec:
# Optional. Derived from `source` when omitted (the Backup CR / BackupConfig CR
# knows its Repository). Required only with `source.identity`.
# repository: { kind: Repository, name: nas-primary, namespace: backups }
# Exactly one of the following. Webhook-enforced.
source:
# Preferred: a Backup CR (operator-initiated, manual, or discovered — all same kind).
backupRef: { name: postgres-data-20260524-021300, namespace: billing }
# Or a BackupConfig CR — resolves via identity against the repo, even if no Backup
# CR has ever been created in this cluster (deploy-or-restore on a fresh cluster
# against an existing repo).
# fromConfig:
# name: postgres-data
# asOf: 2026-05-23T20:00:00Z
# offset: 0 # 0 = latest, 1 = previous, ...
# Or a raw kopia identity (works for foreign writers or snapshots that have
# aged out of the K8s-side catalog window).
# identity:
# username: postgres-data
# hostname: billing
# sourcePath: /data
# snapshotID: k1f1ec0a8 # or asOf / offset
# spec.repository is REQUIRED when using `identity`.
# Optional. Three modes.
target:
# Mode 1: operator creates the PVC.
pvc:
name: postgres-data-restored
storageClassName: fast-ssd
capacity: 100Gi
accessModes: [ReadWriteOnce]
# Mode 2: write into an existing PVC.
# pvcRef: { name: postgres-data-restored }
# Mode 3: omit `target` entirely — passive. A PVC with
# spec.dataSourceRef -> this Restore kicks off the populator handshake.
options:
enableFileDeletion: false
ignorePermissionErrors: true
writeFilesAtomically: true
policy:
onMissingSnapshot: Fail # default for explicit sources (backupRef/identity)
# For source.fromConfig the default is Continue (see §4.6 for the deploy-or-restore pattern).
waitTimeout: 5m
status:
phase: Restoring # Pending | Resolving | Restoring | Completed | Failed
resolved: # pinned at admission
backupRef: { name: postgres-data-20260524-021300, namespace: billing }
repository: { kind: Repository, name: nas-primary, namespace: backups }
pinnedAt: 2026-05-24T17:33:11Z
identity:
username: postgres-data
hostname: billing
sourcePath: /data
target: # what the operator is writing into
pvcPrime: pvc-prime-9f8e2c1b # populator handshake (passive / pvc-create modes)
pvcRef: { name: postgres-data-restored }
timing:
startTime: 2026-05-24T17:33:14Z
progress:
bytesRestored: 8123456789
filesRestored: 998111
Why: addresses G7 (fail-closed defaults), G8/G9 (admission-time resolution, no drift on re-apply), G10 (single restore path covers populator and in-place uniformly). Three source modes cover the spectrum: K8s-native reference (backupRef), recipe-driven (fromConfig — the GitOps deploy-or-restore pattern), and raw kopia identity (foreign writers, aged-out catalog). spec.repository is derivable for the first two; required only when raw identity is the source.
3.7 Maintenance
apiVersion: kopia.io/v1alpha1
kind: Maintenance
metadata: { name: nas-primary, namespace: backups }
spec:
repository:
kind: Repository # Repository | ClusterRepository
name: nas-primary
schedule:
quick: { cron: "0 */6 * * *", jitter: 30m }
full: { cron: "0 3 * * 0", jitter: 1h }
timezone: UTC
ownership:
owner: "kopia-operator/nas-primary"
takeoverPolicy: PromptCondition # Never | PromptCondition | Force
mover:
resources: { requests: { cpu: 250m, memory: 1Gi }, limits: { cpu: "2", memory: 4Gi } }
failurePolicy:
backoffLimit: 1
activeDeadlineSeconds: 14400
status:
ownership:
owner: "kopia-operator/nas-primary"
claimedAt: 2026-05-12T08:14:02Z
quick:
lastRunAt: 2026-05-24T12:00:11Z
nextScheduledAt: 2026-05-24T18:00:00Z
consecutiveFailures: 0
lastContentReclaimedBytes: 1234567
full:
lastRunAt: 2026-05-19T03:01:42Z
nextScheduledAt: 2026-05-26T03:00:00Z
consecutiveFailures: 0
lastContentReclaimedBytes: 89456789012
conditions:
- type: OwnershipClaimed
status: "True"
Why: at most one Maintenance per Repository/ClusterRepository (webhook-enforced) — kills the perfectra1n cross-namespace first-writer-wins race by making the conflict unrepresentable. lastContentReclaimedBytes is the only place storage reclamation is surfaced; per-Backup deletion only marks manifests for GC (§4.5).
4. Key behaviors
4.1 Scheduling
- CronJob-style wall-clock anchoring (
cron(now)), not last-completion anchoring. Fixes volsync's drift. jitteris deterministic, derived fromBackupSchedule.UID + base scheduledAt. HA operator replicas compute identical fire times without coordination; controller restarts re-derive the same value without persisting it. No "re-roll on restart" hazard.cron: "H * * * *"literalHsubstitution; result pinned instatus.lastSchedule.scheduledAt.runOnCreate: falseis the default (G3).concurrencyPolicy: Forbidis the default; skipped runs surface a condition rather than silently piling up (G5/G18).- Validating webhook parses the cron expression with the same parser the controller uses at runtime — bad expressions rejected at apply time, not at first reconcile.
4.2 Identity model
For a BackupConfig C in namespace N backing up PVC P:
usernamedefaults toC.metadata.name;hostnamedefaults toN.- For
ClusterRepositoryconsumers, the repository'sidentityDefaultstemplates apply unlessBackupConfig.spec.identityoverrides them. sourcePathdefaults to/pvc/<P>;sourcePathStrategy: PVCNamespacedNameis available for multi-namespace selectors.- Resolved identity always appears in
BackupConfig.status.resolved.identityandBackup.status.snapshot.identity.
This is the part where a kopia-native operator can do better than volsync's accidental design — identity is the API, not an internal detail.
4.3 Repository sharing
Many BackupConfigs point at one Repository or ClusterRepository. Each writes under its own identity, so snapshots never collide. The repo is created lazily on first connect failure; the race is mediated by kopia's own object-store guarantees plus a per-repo lease in the operator. The RESTIC_HOST="volsync" anti-pattern doesn't apply.
For ClusterRepository, the allowedNamespaces gate is enforced at admission on BackupConfig, Backup (manual), Restore, and Maintenance. A namespace that loses its allow-list entry retains its existing Backup CRs (no retroactive deletion) but cannot create new ones.
4.4 Retention enforcement
BackupConfig.spec.retention is the only retention mechanism for successful operator-initiated and manual Backups. It is enforced operator-side by pruning Backup CRs; each pruned CR's deletionPolicy then drives what happens to the underlying snapshot.
Algorithm. On every Backup completion under a BackupConfig, and on a periodic timer per BackupConfig:
- List all
BackupCRs in the operator's cache wherestatus.resolved.repositoryandstatus.snapshot.identitymatch thisBackupConfig's resolved values, andorigin ∈ {operator, manual}. - Sort by
status.timing.endTimedescending. - Apply the GFS retention buckets in order:
keepLatest,keepHourly,keepDaily,keepWeekly,keepMonthly,keepAnnual. ABackupqualifies for a bucket if itsendTimeis the most recent within that bucket's window. - Any
Backupnot selected by any bucket is deleted. - Deletion runs through the standard
kopia.io/snapshot-cleanupfinalizer (§4.5).
Failed Backups are governed by BackupSchedule.spec.failedJobsHistoryLimit (operator-origin failures) or are user-managed (manual-origin failures). They are not subject to GFS.
Discovered Backups are governed by Repository.spec.catalog.retain / ClusterRepository.spec.catalog.retain. When a discovered CR ages out of the catalog window, the operator deletes the CR; the forced deletionPolicy: Retain ensures the underlying snapshot remains in the repository (§4.5).
Exclusivity with Kopia-side retention policies. The operator does not invoke kopia policy set --keep-*. Repository-level retention policies set by users running the kopia CLI directly against an operator-managed repository will conflict with CR-driven retention and may cause double-deletion. The validating webhook on Repository rejects inline policy fields that would set retention at the repo level. This is documented as unsupported.
Why not also a flat-count cap on BackupSchedule? Two retention drivers for the same set of objects creates rule-precedence questions and surprises (a flat cap can silently undercut a GFS policy that should have retained an annual snapshot). GFS alone, enforced consistently, is the simpler model. Users who want a hard cap can set keepLatest low.
4.5 Backup deletion semantics
A Backup CR owns the lifecycle of its kopia snapshot by default. The finalizer kopia.io/snapshot-cleanup is added to every Backup at admission and is the load-bearing mechanism.
deletionPolicy defaults by origin:
| Origin | Default | Other values allowed? |
|---|---|---|
operator | Delete (inherits BackupConfig.spec.defaultDeletionPolicy if set) | Yes |
manual | Delete (inherits BackupConfig.spec.defaultDeletionPolicy if set) | Yes |
discovered | Retain | No — webhook rejects Delete/Orphan |
The discovered restriction prevents data loss: the operator did not create those snapshots and may not be the only writer. Aging out a discovered Backup CR via catalog.retain is a Kubernetes-side cleanup, not a repository-side one.
Behaviour on CR deletion:
deletionPolicy | Action | Final state |
|---|---|---|
Delete | Operator spawns a one-shot mover pod that runs kopia snapshot delete --delete <kopiaSnapshotID>. On success, the finalizer is removed and the CR disappears. | Manifest deleted; content reclaimed at next maintenance run. |
Retain | Finalizer is removed immediately; CR disappears. | Snapshot remains in the repository, still discoverable via the catalog (and may rematerialize as origin: discovered). |
Orphan | Operator removes tracking labels (kopia.io/backup-config, kopia.io/identity-hash, …) so the snapshot is no longer surfaced under this config. Finalizer is then removed. | Snapshot remains; will be visible only via raw identity or as a discovered backup if it falls inside the catalog window. |
Failure during Delete. If kopia snapshot delete fails, the CR stays in phase: Deleting with a SnapshotDeletionFailed condition and an exponential-backoff retry. The CR is not silently dropped — operators want to see "your snapshot wasn't actually deleted."
Force-delete escape hatch. When the repository is unreachable and the user needs the CR gone:
kubectl annotate backup postgres-data-20260524-021300 \
kopia.io/skip-snapshot-cleanup=true --overwrite
kubectl delete backup postgres-data-20260524-021300
The annotation causes the finalizer to remove itself without running the delete pod. The controller emits a warn-level log line and a SnapshotOrphaned Event recording the kopia snapshot ID for audit. The same annotation works on stuck Maintenance runs.
Manifest deletion vs. content reclamation. Kopia marks manifests deleted immediately, but on-disk content is reclaimed only during maintenance. The model honors this:
Backup.status.phasetransitionsSucceeded → Deleting → (CR removed).- The byte-level storage drop appears on
Maintenance.status.{quick,full}.lastContentReclaimedBytes, never on anyBackupfield.
This asymmetry is called out in user-facing documentation because it is the kind of thing that causes "I deleted the backup, why is my bucket the same size?" support questions.
4.6 Restore resolution & semantics
Restore.spec.source resolution at admission:
| Source mode | Resolution | spec.repository required? | Default onMissingSnapshot |
|---|---|---|---|
backupRef | Look up the Backup CR; derive repository from it | No | Fail |
fromConfig | Resolve identity from BackupConfig, query repo directly | No (derived from the BackupConfig) | Continue |
identity | Direct kopia query | Yes | Fail |
fromConfig + Continue is the deploy-or-restore pattern: apply Repository + BackupConfig + BackupSchedule + Restore + workload PVC together. Fresh cluster against an existing repo → PVC restored from latest. Fresh cluster against empty repo → PVC binds empty, BackupSchedule starts producing Backups under the same identity, and a future redeploy restores from there. No manifest changes between the two cases.
writeFilesAtomically: true is the default. ignorePermissionErrors: true is the default (and surfaces a condition if any errors occurred — non-silent).
4.7 Volume populator
To clarify the field's status on modern Kubernetes:
PVC.spec.dataSourceRefis GA since 1.24 via theAnyVolumeDataSourcefeature gate (default-on).- A populator controller (this operator) watches PVCs whose
dataSourceRefreferences its kind and runs thepvc-prime+claimRef-rebind handshake. kubernetes-csi/volume-data-source-validator(which ships thepopulator.storage.k8s.io/VolumePopulatorCRD) is optional. Without it, PVCs that mistype their populator ref hangPending. With it, they're rejected at admission. The actual data-moving machinery works either way.
Our position: the populator path works on any cluster ≥ 1.24 without installing anything extra. If the VolumePopulator CRD is present at operator startup, we register ourselves for the better UX; if absent, we log it and carry on. No hard dependency.
This addresses G10 by making the populator path uniform (passive Restore) and never gating it on copy-method.
4.8 Hooks
hooks.beforeSnapshot[] / hooks.afterSnapshot[] accept one of:
workloadExec—kubectl exec-style into a matched workload pod/container (the default and most-requested form, fork #22).runJob: { jobSpec: ... }— fullJobSpecto run as a one-shot Job (k8up-stylePreBackupPodanalog). NamedrunJobto make the materialization explicit.httpRequest— typed POST to a URL for cross-system orchestration.
Hook failures abort the backup by default; continueOnFailure: true is opt-in per hook.
4.9 Multi-PVC consistency
groupBy: VolumeGroupSnapshot is the default for multi-PVC sources. If the chosen volumeSnapshotClass's driver doesn't support VGS, the BackupConfig reports a GroupSnapshotUnsupported condition and refuses to run. Silently falling back to per-PVC snapshots would mean inconsistent backups — the same data-integrity hazard as #1211. To intentionally accept per-PVC snapshots, set groupBy: None explicitly.
4.10 Observability
| Surface | Volsync | This operator |
|---|---|---|
| Per-PVC metrics labels | absent (#518) | always present (pvc, pvc_namespace, backup_config, repository) |
| Stale-metrics-on-delete | yes (#1194) | metrics are CR-scoped, deleted with the CR |
lastSuccessAt | only derivable | BackupSchedule.status.lastSuccessfulSchedule first-class |
| Snapshot count | exec into pod | kubectl get backup -l kopia.io/backup-config=... |
| Logs | tail of last pod | small tail (~4KB) in Backup.status.logTail; full logs in Job pod |
| Repo storage stats | absent | Repository.status.storageStats |
| Content reclamation | absent | Maintenance.status.*.lastContentReclaimedBytes |
SLO-friendly metrics:
kopia_operator_backup_last_success_timestamp_seconds{backup_config,namespace}— gauge.kopia_operator_backup_consecutive_failures{backup_config,namespace}— gauge.kopia_operator_restore_duration_seconds{...}— summary (p50/p90/p99).kopia_operator_snapshot_deletion_failures_total{repository}— counter; alert on rate.kopia_operator_orphaned_snapshots_total{repository}— counter; incremented byskip-snapshot-cleanupescape hatch.
4.11 Security & RBAC
- Operator is namespaced by default; cluster-scoped install is opt-in via Helm value. The
ClusterRepositoryCRD is registered regardless (it's the shape; whether tenants can read it is RBAC). - Mover pods run as
runAsNonRoot: true,runAsUser: 65534(nobody) by default. Files restored may not match original ownership — documented trade-off, not a hidden surprise. mover.securityContext: {...}— explicit override.mover.inheritSecurityContextFrom: { podSelector }— opt-in best-effort copy from a live consumer; fails loud (condition) if no pod matches at backup time. Not a default because the workload may be scaled to zero.mover.privilegedMode: true(namespace-gated bykopia.io/allow-privileged-movers: "true") — runs withCHOWN/FOWNERfor clean ownership restoration. Explicit double opt-in.lost+foundand similar system entries are skipped by default (fork #1033/#1889).
4.12 Mover pods & failure handling
- Jobs use
restartPolicy: NeverandbackoffLimit: spec.failurePolicy.backoffLimit(default2). concurrencyPolicy: Forbiddefault; missed slots produce aBackupSkippedcondition.activeDeadlineSecondsdefault7200.- Completed mover pods are reaped on the same reconcile that observes their terminal status — no zombie pods (fork #8).
4.13 Forward compatibility
Every credential, schedule, policy, and identity surface is a sub-object rather than a leaf field, so future fields slot in without changing the basic shape. Specifically deferred but accommodated:
Repository.spec.encryption.rotation/ClusterRepository.spec.encryption.rotation— password rotation flow.Repository.spec.access.readOnly— read-only repo mode for restore-only operators.Repository.spec.backend.<type>.auth.workloadIdentity— IRSA/WIF (already structurally present; deprioritized for the homelab default).Backup.spec.parameters— typed run-time overrides beyond just tags.
The kopia.io API group itself is v1alpha1; webhook conversions to v1beta1/v1 will be additive only.
The API-server-dependency-on-the-operator's-webhook concern is bounded: webhooks intercept only kopia.io/* CRDs. PVC admission, populator dispatch, and in-flight restore reconciliation never depend on the webhook being up. Standard failurePolicy: Fail is appropriate.
5. Usage walkthroughs
5.1 Single PVC, scheduled daily
apiVersion: v1
kind: Secret
metadata: { name: nas-primary-creds, namespace: backups }
stringData:
AWS_ACCESS_KEY_ID: ...
AWS_SECRET_ACCESS_KEY: ...
KOPIA_PASSWORD: choose-something-long
---
apiVersion: kopia.io/v1alpha1
kind: Repository
metadata: { name: nas-primary, namespace: backups }
spec:
backend:
s3:
bucket: my-backups
prefix: prod/
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
auth: { secretRef: { name: nas-primary-creds } }
encryption: { passwordSecretRef: { name: nas-primary-creds, key: KOPIA_PASSWORD } }
create: { enabled: true }
---
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata: { name: postgres-data, namespace: billing }
spec:
repository: { kind: Repository, name: nas-primary, namespace: backups }
sources: [{ pvc: { name: postgres-data } }]
retention: { keepDaily: 14, keepWeekly: 4 }
---
apiVersion: kopia.io/v1alpha1
kind: BackupSchedule
metadata: { name: postgres-data-nightly, namespace: billing }
spec:
configRef: { name: postgres-data }
schedule:
cron: "H 2 * * *"
jitter: 30m
runOnCreate: false
Maintenance is implicit — a default Maintenance is created on first reference to a Repository unless one already exists.
5.2 Shared platform repository
apiVersion: kopia.io/v1alpha1
kind: ClusterRepository
metadata: { name: shared-primary }
spec:
backend:
s3:
bucket: org-kopia-repo
prefix: "" # bucket root, maximum dedup
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
auth:
secretRef: { name: kopia-platform-creds, namespace: kopia-system }
encryption:
passwordSecretRef: { name: kopia-platform-creds, namespace: kopia-system, key: KOPIA_PASSWORD }
allowedNamespaces:
list: [billing, payments, identity]
identityDefaults:
hostnameTemplate: "{{ .Namespace }}"
usernameTemplate: "{{ .Namespace }}-{{ .ConfigName }}"
---
# In the tenant namespace — no need to know the secret name or platform details
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata: { name: postgres-data, namespace: billing }
spec:
repository: { kind: ClusterRepository, name: shared-primary }
sources: [{ pvc: { name: postgres-data } }]
retention: { keepDaily: 14 }
Identity resolves to billing-postgres-data@billing:/pvc/postgres-data via the templates.
5.3 Restore by picking a backup
kubectl get backup -n billing \
-l kopia.io/backup-config=postgres-data \
--sort-by=.status.timing.startTime
apiVersion: kopia.io/v1alpha1
kind: Restore
metadata: { name: postgres-restore-yesterday, namespace: billing }
spec:
source:
backupRef: { name: postgres-data-20260523-021300, namespace: billing }
target:
pvc:
name: postgres-data-restored
storageClassName: fast-ssd
capacity: 100Gi
accessModes: [ReadWriteOnce]
5.4 Multi-PVC selector
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata: { name: app-bundle, namespace: billing }
spec:
repository: { kind: Repository, name: nas-primary, namespace: backups }
identity: { username: app-bundle, hostname: billing }
sources:
- pvcSelector:
labelSelector: { matchLabels: { backup: include } }
sourcePathStrategy: PVCName
groupBy: VolumeGroupSnapshot
retention: { keepDaily: 14 }
5.5 Deploy-or-restore (GitOps)
The headline pattern. Apply everything together; on a fresh cluster against an existing repo, the PVC restores; on a fresh repo, it comes up empty and gets backed up going forward.
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata: { name: postgres-data, namespace: billing }
spec:
repository: { kind: Repository, name: nas-primary, namespace: backups }
sources: [{ pvc: { name: postgres-data } }]
---
apiVersion: kopia.io/v1alpha1
kind: BackupSchedule
metadata: { name: postgres-data-nightly, namespace: billing }
spec:
configRef: { name: postgres-data }
schedule: { cron: "H 2 * * *", jitter: 30m, runOnCreate: false }
---
apiVersion: kopia.io/v1alpha1
kind: Restore
metadata: { name: postgres-data-restore, namespace: billing }
spec:
source: { fromConfig: { name: postgres-data, offset: 0 } }
policy: { onMissingSnapshot: Continue } # default for fromConfig — explicit here for clarity
# No target — passive. The PVC below references this Restore.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata: { name: postgres-data, namespace: billing }
spec:
storageClassName: fast-ssd
resources: { requests: { storage: 100Gi } }
accessModes: [ReadWriteOnce]
dataSourceRef:
apiGroup: kopia.io
kind: Restore
name: postgres-data-restore
5.6 Manual one-shot backup
apiVersion: kopia.io/v1alpha1
kind: Backup
metadata: { name: postgres-data-pre-migration, namespace: billing }
spec:
configRef: { name: postgres-data }
tags: { reason: "pre-schema-migration" }
deletionPolicy: Retain # I want this one to survive my next prune
Equivalently from any external system: Argo Events Sensor, Tekton Task, GitHub Actions, webhook handler. The Backup CR is the universal entry point.
5.7 Restore from a discovered (foreign / pre-install) backup
# Discovered Backups live in the Repository's namespace because the operator
# has no reliable way to attribute them to a BackupConfig.
kubectl get backup -n backups -l kopia.io/origin=discovered
apiVersion: kopia.io/v1alpha1
kind: Restore
metadata: { name: rescue-restore, namespace: billing }
spec:
source:
backupRef: { name: kopia-disc-9c2a1f, namespace: backups }
target: { pvc: { name: rescue-pvc, storageClassName: fast-ssd, capacity: 50Gi, accessModes: [ReadWriteOnce] } }
5.8 Forcing CR removal when the repo is offline
kubectl annotate backup postgres-data-pre-migration -n billing \
kopia.io/skip-snapshot-cleanup=true --overwrite
kubectl delete backup postgres-data-pre-migration -n billing
# Snapshot remains in the repo and will rematerialize as `origin: discovered`
# once Repository is healthy and within the catalog window.
5.9 Suspending a schedule via GitOps
spec:
schedule:
suspend: true # apply via PR; un-suspend in a follow-up PR
In-flight Backups are unaffected; only future cron firings are skipped.
6. Consequences
6.1 Positive
- Kopia-native ergonomics: identity, policy, hooks, snapshot listing all map to user mental models 1:1 with kopia.
- Three-layer triggering — any source can fire a
Backup. Solves#1559and#627together. - GitOps deploy-or-restore is a single-manifest pattern, not a runbook.
kubectl get backupis the one place to look — runs, history, and the catalog all live there.- Snapshot lifecycle = CR lifecycle, configurable. The "I deleted my
ReplicationSourcebut my snapshots are still there" volsync confusion is structurally impossible. - Cluster-scoped and namespace-scoped repositories are both first-class. Platform teams get a single-object shared repository; app teams get private repositories without cross-namespace RBAC plumbing.
- Multi-PVC, multi-namespace, one-repo: first-class.
- VolumeGroupSnapshot on by default; degrades loudly, never silently.
- No bash mover scripts.
6.2 Negative / trade-offs
- Seven CRDs to volsync's two — discoverability cost. Mitigated by hiding
Maintenancefrom the typical first-time user flow (the simple case in §5.1 doesn't reference it explicitly) and by overloadingBackupto cover both the "operator made this" and "we found this in the repo" cases. origin: discoveredBackupCRs add etcd load. Mitigated bycatalog.retainbounds.- Webhook resolution of
Restore.source.backupRefmeans restoring a snapshot just outside the catalog window requires the raw-identity escape hatch. Documented. - Kopia version pinning is a single choice baked into the operator image; mitigated by a
Repository.spec.kopiaImageOverrideescape hatch for advanced users. - The default
deletionPolicy: Deleteon operator/manual backups is a sharp edge for users coming from volsync, where deleting aReplicationSourceis a safe operation. Documentation must lead with this difference and offerdefaultDeletionPolicy: RetainonBackupConfigas the conservative migration default. - "Manifest deleted ≠ storage reclaimed" asymmetry will generate support questions. Mitigated by exposing
lastContentReclaimedBytesonMaintenanceand by metrics, but it remains an unavoidable property of kopia. - Coexists with volsync rather than supplanting it; users wanting rsync/syncthing keep volsync.
7. Deferred / open questions
These are real questions we've punted on. Each warrants its own ADR before implementation.
- Cron library implementation. Out of scope per the ADR header. Requirements include: deterministic jitter (derived from a stable seed), IANA TZ database, manual-trigger primitive, runtime schedule updates, missed-run policies on operator restart, ISC-style DST handling.
BackupWorkflow/dependsOn. Backup→verify→cleanup pipelines (e.g., automatic restore-into-scratch verification) are a natural follow-up. The trigger schema does not preclude this. Likely v1alpha2.RestoreSchedule. Scheduled restore verification can be done withCronJobapplyingRestoreCRs today; whether to first-class it is deferred. We will ship a documented example.- Repository password rotation.
Repository.spec.encryptionis a sub-object so the surface exists; the flow (rolling write to all repo blobs, coordinated with maintenance) needs its own design. - Cross-cluster restoration. Identity model accommodates per-cluster hostname prefixes; the operational surface (presenting one cluster's repo to another) is out of scope for v1alpha1.
- VolSync migration tooling. Likely a
kubectlplugin that translatesReplicationSource+ReplicationDestinationintoBackupConfig+BackupSchedule+Restore. Separate ADR. - Performance characterization at high CR counts. A workload backed up hourly with 14-day daily / 8-week weekly retention will hold ~50
BackupCRs per workload at steady state — manageable, but warrants benchmarks at 10kBackupCRs per namespace before GA. - Discovered backup attribution. When a discovered snapshot's identity matches a known
BackupConfig, should the operator place the discoveredBackupin theBackupConfig's namespace instead of theRepository's? Improves locality but adds attribution complexity. Deferred.
8. References
- VolSync upstream: https://github.com/backube/volsync
- VolSync kopia fork (perfectra1n): https://github.com/perfectra1n/volsync
- Kopia mover PR: https://github.com/backube/volsync/pull/1723
- Trigger redesign proposal: https://github.com/backube/volsync/issues/1559
- Kopia: https://kopia.io/
- CloudNativePG (Backup/ScheduledBackup pattern): https://cloudnative-pg.io/documentation/current/backup/
- KEP-1495 AnyVolumeDataSource: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1495-volume-populators
kubernetes-csi/volume-data-source-validator: https://github.com/kubernetes-csi/volume-data-source-validatorkubernetes-csi/lib-volume-populator: https://github.com/kubernetes-csi/lib-volume-populator
Appendix A: Field-by-field comparison vs volsync
| Concern | volsync | this operator |
|---|---|---|
| Repo as a resource | Secret reference | Repository + ClusterRepository CRDs |
| Cluster-scoped shared repo | not expressible | ClusterRepository with allowedNamespaces |
| Triggering layers | one (trigger field on source) | three (BackupConfig / Backup / BackupSchedule) |
| Manual / external trigger | trigger.manual: <value> string-change | kubectl create backup (G17, #1559) |
| Snapshot lifecycle on CR delete | unaffected (G20) | deletionPolicy: Delete (default) / Retain / Orphan; finalizer-driven |
| Force-delete escape hatch | n/a | kopia.io/skip-snapshot-cleanup: "true" annotation |
| Discovered / foreign snapshots | exec into mover pod | Backup CR with origin: discovered; forced Retain |
| Multi-PVC | not supported | pvcSelector + groupBy: VolumeGroupSnapshot default |
| Multi-PVC consistency on unsupported drivers | n/a | fail loud (GroupSnapshotUnsupported); groupBy: None opts into per-PVC |
| First-sync skip | not supported (#627) | runOnCreate: false default |
| Cron jitter | not supported (#1421) | deterministic jitter + H substitution |
| Cron timezone | not supported (#702) | schedule.timezone |
| Schedule anchor | last-completion | wall-clock (CronJob-style) |
| Concurrency policy | implicit | concurrencyPolicy: Forbid default |
| BackoffLimit | hard-coded 8 | per-Backup failurePolicy.backoffLimit |
| Retention | restic forget flags | GFS on BackupConfig, operator prunes CRs; CR-driven exclusive |
| Snapshot as a K8s object | absent | Backup CR (operator-initiated, manual, or discovered) |
| Restore by snapshot reference | no — only restoreAsOf | source.backupRef |
| Restore on a fresh cluster against an existing repo | manual runbook | source.fromConfig resolves via identity |
| Missing snapshot at restore | silently succeeds with empty PVC | Fail default; Continue for fromConfig (deploy-or-restore) |
| Maintenance | embedded in mover pod | own Maintenance CRD |
| Maintenance ownership | implicit | explicit lease + status |
| Snapshot catalog | exec into pod | Backup CRs (origin: discovered), bounded materialization |
| Hooks | shell-string in mover | typed workloadExec / runJob / httpRequest |
| Policy passthrough | restic flags only | typed policy.* + extraArgs escape hatch |
| Per-PVC metrics | absent (#518) | first-class labels |
| Stale metrics | observed (#1194) | metric-per-CR, cleared on delete |
| Content reclamation visibility | absent | Maintenance.status.*.lastContentReclaimedBytes + metric |
| Mover image default tag | :latest | digest-pinned per operator release |
| Zombie mover pods | observed (fork #8) | reaped on terminal-status reconcile |
| Lost+found root files | break unprivileged restore (#1033) | skipped by default |
| Restore target modes | destinationPVC (in-place) OR populator | target.pvcRef / target.pvc / passive (populator) — uniform |
| Populator dependency | requires reading the docs to know | clarified: works on any 1.24+ cluster; volume-data-source-validator is an optional UX nicety |
ADR-0001: A Kopia-Native Backup Operator for Kubernetes
- Status: Proposed
- Date: 2026-05-24
- Inspired by:
backube/volsyncand the kopia forkperfectra1n/volsync(especially PRbackube/volsync#1723and the trigger-redesign proposalbackube/volsync#1559). The triggering model also draws on CloudNativePG (Cluster/ScheduledBackup/Backup) and Tekton (Task/TaskRun).
Scope: this ADR covers CRD shape, user experience, and high-level design choices. It deliberately does not specify Go package layout, controller-runtime indexes, leader-election lease IDs, or other implementation mechanics — those belong to follow-up ADRs once the API surface is agreed.
1. Context
VolSync is the de-facto Kubernetes-native mover for PVCs. Its design is mature and battle-tested, but it has accreted around restic's model. As soon as you try to add a non-restic mover (kopia, rustic, borg, …) several deep design choices push back. The community fork perfectra1n/volsync proves out a kopia mover and ships a usable image — but its PR has been open ~13 months without merging, the upstream maintainers are capacity-constrained, and many users have switched to running the fork in production.
The fork's existence and the volume of feature requests around kopia/restic locking, multi-PVC backup, scheduling jitter, restore UX, trigger separation, and "stop running on apply" suggest something stronger than "land kopia in volsync" is warranted. A kopia-native operator can:
- Drop the multi-mover abstraction entirely. Kopia is the only mover, so every CRD field can be expressive without leaking through a generic shape.
- Make a repository a first-class Kubernetes resource. Kopia repos are designed to be shared across many writers — a fact volsync cannot express cleanly.
- Separate recipe, invocation, and schedule so backups can be triggered by any source (cron,
kubectl create, Argo Events, button-in-Grafana). Volsync'striggerfield couples all three. - Use kopia's native identity model (
username@hostname:path) deliberately rather than as an accident ofmetadata.name/metadata.namespace. - Treat
kopia maintenanceand snapshot lifecycle as first-class operator concerns rather than retrofits. - Surface kopia's snapshot catalog through CRDs so restore is "browse and reference," not "construct an
restoreAsOftimestamp and hope." - Address the long backlog of papercuts as design decisions, not bug fixes.
We refer to the project as kopia-operator in this document; final naming is out of scope. The API group is kopia.io with initial version v1alpha1.
1.1 The most important gaps we are addressing
| # | Gap | volsync refs |
|---|---|---|
| G1 | Repository is not a Kubernetes resource; cannot be shared/reused cleanly | implicit; perfectra1n CRD shape |
| G2 | One ReplicationSource = one PVC | #1115, #1116, #320 |
| G3 | First reconcile triggers an immediate backup, no GitOps-friendly "skip first run" | #627 |
| G4 | No cron jitter / H substitution, no timezone | #1421, #702 |
| G5 | Restic repo locking / piling-up jobs | #1042, #1429, #646 |
| G6 | No retry-limit / backoffLimit override | #1228, #1042 |
| G7 | Restore proceeds with empty PVC if no snapshot found | #1211 |
| G8 | Snapshot selection is restic-format restoreAsOf only; no browse | #7, #1211 |
| G9 | latestImage always wins — no immutable restore source | disc #1115 |
| G10 | Volume populator + Direct copyMethod incompatibility | disc #1115, #1129 |
| G11 | Maintenance ownership is implicit & runs in the same pod as backup | perfectra1n fork redesigned this three times |
| G12 | Policy passthrough is brittle: every kopia knob needs CRD/jq script changes | fork #13, #23 |
| G13 | Snapshot actions run in mover, not workload | fork #22 |
| G14 | OOMs unpredictable; no resource guidance | #626, #707, #1228 |
| G15 | Mover image is :latest by default | volsync restic/builder.go:42 |
| G16 | Restricted PSA / OpenShift SCC / unprivileged-mode lost+found papercuts | #367, #1033, #1889, #1430 |
| G17 | Trigger semantics are baked into the source CR — no manual/external trigger path | #1559 |
| G18 | Mover-pod lifecycle (zombie pods, stuck jobs) | fork #8, volsync #1415 |
| G19 | Maintainers' explicit door-closing on new movers | #1743, #1029, #320 |
2. Decision
2.1 Topology
Five CRDs in kopia.io/v1alpha1, all namespaced:
| CRD | Layer | Purpose |
|---|---|---|
Repository | Storage | A kopia repository: credentials, backend, encryption, optional catalog-materialization bounds. Many BackupConfigs/Restores reference one. |
BackupConfig | Recipe | What to back up: PVC selector, identity, retention, policy, hooks. Idempotent — doesn't run anything on its own. |
Backup | Invocation + Catalog | A single kopia snapshot as a Kubernetes object. Created by BackupSchedule, kubectl create, or any other trigger source. Also materialized by the operator from the kopia catalog for snapshots it didn't produce (foreign or pre-install). |
BackupSchedule | Cron | When it runs: cron (with jitter + timezone) + configRef. Creates Backup CRs. |
Restore | Operation | A restore from a snapshot/identity to a PVC. Used directly, or referenced by PVC.spec.dataSourceRef. |
Maintenance | Lifecycle | One per Repository: schedules kopia maintenance run quick + full, manages ownership lease. |
The three-layer split (recipe / invocation / schedule) for backups is the deliberate response to volsync #1559. It means:
- A
Backupcan be created from anywhere —kubectl create, Argo Events, a Tekton pipeline, a webhook handler. - A
BackupScheduleis just one source ofBackupCRs. Removing or pausing a schedule does not affect already-running or already-completed runs. - A
BackupConfigchange applies to subsequent invocations; the operator snapshots resolved values into eachBackup.status.resolved...for traceability.
Backup is also the single canonical representation of a kopia snapshot — both ones we produced and ones we discover in the repo. Two retention drivers cover the lifecycle:
BackupSchedule.spec.successfulJobsHistoryLimitGCs schedule-spawnedBackups.Repository.spec.catalog.retainGCsorigin: discoveredBackups, bounding etcd footprint for large repos.- Manual
BackupCRs are user-owned and never auto-GC'd.
Dedup key is (Repository.UID, kopiaSnapshotID) — the operator will not create a discovered Backup for a snapshot already represented by an operator-initiated one.
Restore stays as a single CR (it's an operation, not a recurring thing). For the dataSourceRef-driven populator pattern, a Restore is left in passive mode (no target) and consumed by zero-or-more PVCs.
2.2 Anchoring principles
- Repositories are objects. Identity, lifecycle, and maintenance hang off them.
- Triggering is decoupled.
BackupConfigsays what;Backupsays that;BackupSchedulesays when. Any of the three can be authored or automated independently. - A
Backupis a kopia snapshot. Operator-initiated, manually-applied, and discovered snapshots are all the same kind. - Restores are explicit. No silent "empty PVC because no snapshots existed yet" by default. The "deploy-or-restore" GitOps pattern is opt-in via a specific source mode +
onMissingSnapshot: Continue. - Maintenance is a first-class lifecycle concern, with its own CRD and explicit ownership lease.
- The mover is a thin shim. A Go-native controller invokes
kopia --jsonand parses results. No 2,500-line bash scripts. The image carrieskopiaand nothing else. - Validation is webhook-enforced. Mutually exclusive fields, missing repository references, malformed schedules — rejected at admission.
- Identity is explicit and overridable. Defaults derive from object name/namespace; every component is overridable; the resolved identity always appears in
status. - Forward-compatible by construction. Every credential, policy, and rotation surface is a sub-object, so future fields slot in without API breakage (see §4.10).
2.3 Where Backup CRs live
| Origin | Namespace |
|---|---|
operator — created by BackupSchedule | The BackupConfig's namespace (so the owning team sees their backups with kubectl get backup -n <team>). |
manual — created by kubectl create or external automation | Whichever namespace the user applies it to. The configRef may cross namespaces, subject to RBAC. |
discovered — materialized from the kopia catalog | The Repository's namespace. The operator has no reliable way to attribute a foreign snapshot to a BackupConfig, so it stays with the repo. |
Restore.spec.source.backupRef carries { name, namespace } for cross-namespace references.
3. CRD Design
3.1 Repository
Owns credentials, encryption, and repository-wide settings. Catalog materialization for discovered Backup CRs is configured here.
apiVersion: kopia.io/v1alpha1
kind: Repository
metadata:
name: nas-primary
namespace: backups
spec:
# Exactly one backend block. Webhook-enforced.
backend:
s3:
bucket: my-backups
prefix: clusters/prod/
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
auth:
secretRef:
name: nas-primary-creds # keys: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, ...
# Optional advanced auth — workloadIdentity supported but not the default.
# workloadIdentity:
# serviceAccountName: kopia-s3
tls:
caBundleRef:
configMapName: corp-ca
key: ca.crt
insecureSkipVerify: false
encryption:
passwordSecretRef: # always a Secret ref; never inline
name: nas-primary-creds
key: KOPIA_PASSWORD
# Future fields (rotation, previousPasswords, ...) slot in here.
create:
enabled: true # if repo missing, create it
encryption: AES256-GCM-HMAC-SHA256
splitter: DYNAMIC-4M-BUZHASH
hash: BLAKE3-256
cacheDefaults: # inherited by Backup/Restore unless overridden
capacity: 8Gi
storageClassName: fast-ssd
metadataCacheSizeMB: 5000
contentCacheSizeMB: 2000
catalog: # bounds materialization of `origin: discovered` Backup CRs
retain:
perIdentity: 100 # most recent N per username@hostname:path
maxAgeDays: 90 # nothing older than this gets a Backup CR
refreshInterval: 5m
# Older snapshots remain in kopia; restorable via Restore.source.identity.snapshotID
status:
phase: Ready # Pending | Initializing | Ready | Degraded | Failed
observedGeneration: 7
uniqueID: "fb6e...c41a" # kopia repo unique ID
conditions:
- type: Connected
status: "True"
reason: ConnectFromConfig
- type: MaintenanceOwned
status: "True"
message: "kopia-operator/nas-primary"
storageStats:
snapshotCount: 1284
totalSize: 412Gi
lastObservedAt: 2026-05-24T17:00:01Z
catalog:
discoveredBackupCount: 412 # how many Backup CRs materialized from the catalog scan
lastRefreshAt: 2026-05-24T17:01:11Z
Why: addresses G1 (repo as a resource), G15 (digest pinning belongs on the operator image, not embedded per recipe), and provides the catalog-bounds knob that keeps Backup CRs from blowing up etcd while still giving the K8s-native view of kopia history. encryption is a sub-object so future rotation fields fit without API breakage (§4.10).
3.2 BackupConfig
The recipe. Idempotent. Apply once; reference from many Backups or one BackupSchedule.
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata:
name: postgres-data
namespace: billing
spec:
repository:
name: nas-primary
namespace: backups # cross-ns ref; RBAC-gated
# Identity — what kopia sees. Defaults shown.
identity:
username: "postgres-data" # default: <BackupConfig.metadata.name>
hostname: "billing" # default: <BackupConfig.metadata.namespace>
# Sources — what to back up.
sources:
- pvc: { name: postgres-data }
sourcePathOverride: /data # what kopia records (default: /pvc/<name>)
# Or a selector for multi-PVC:
# - pvcSelector:
# namespaceSelector: { matchNames: [billing, billing-staging] }
# labelSelector: { matchLabels: { backup: include } }
# sourcePathStrategy: PVCName # PVCName | PVCNamespacedName
copyMethod: Snapshot # Snapshot (default, PiT) | Clone | Direct
volumeSnapshotClassName: csi-snap-class
groupBy: VolumeGroupSnapshot # default for multi-PVC sources; None opts into per-PVC
retention:
keepLatest: 10
keepHourly: 24
keepDaily: 14
keepWeekly: 8
keepMonthly: 12
keepAnnual: 5
policy: # typed fields — not opaque JSON parsed by jq
compression:
compressor: zstd
neverCompress: ["*.zip", "*.gz", "*.mp4"]
splitter: DYNAMIC-4M-BUZHASH
ignore:
paths: ["*.tmp", "*/cache/*", "lost+found"]
cacheDirs: true # honor CACHEDIR.TAG
ignoreIdenticalSnapshots: true # fork issue #13
extraArgs: [] # escape hatch for kopia flags we don't model yet
hooks: # G13 — runs in the workload, not the mover
beforeSnapshot:
- workloadExec:
podSelector: { matchLabels: { app: postgres } }
container: postgres
command: ["pg_start_backup", "snap"]
timeout: 2m
afterSnapshot:
- workloadExec:
podSelector: { matchLabels: { app: postgres } }
container: postgres
command: ["pg_stop_backup"]
timeout: 2m
mover: # per-recipe overrides
resources:
requests: { cpu: 250m, memory: 512Mi }
limits: { cpu: "2", memory: 4Gi }
cache:
capacity: 16Gi
storageClassName: fast-ssd
securityContext: {} # override; default: nonRoot uid 65534
# privilegedMode: true # opt-in, namespace-gated; preserves UID/GID on restore
# inheritSecurityContextFrom: # opt-in: copy SC from a live workload pod
# podSelector: { matchLabels: { app: postgres } }
status:
resolved: # what would be passed to kopia
identity:
username: "postgres-data"
hostname: "billing"
sources:
- pvc: billing/postgres-data
sourcePath: /data
conditions:
- type: RepositoryReachable
status: "True"
- type: GroupSnapshotSupported
status: "True"
Why: addresses G2 (selector + VolumeGroupSnapshot default), G12 (typed policy + escape hatch), G13 (hook types), G14 (explicit resource defaults), G16 (security-context controls without forcing privileged-by-default). The identity sub-object makes the second-biggest perfectra1n papercut (fork #7) impossible.
3.3 Backup
A single kopia snapshot as a Kubernetes object. Three origins:
operator— created by aBackupSchedule. Spec hasconfigRef; lives in theBackupConfig's namespace.manual— created bykubectl createor external automation. Spec hasconfigRef; lives wherever the user applied it.discovered— materialized by the operator's catalog scan for snapshots it didn't produce. Spec is empty; lives in theRepository's namespace.
apiVersion: kopia.io/v1alpha1
kind: Backup
metadata:
name: postgres-data-20260524-021300
namespace: billing
labels:
# Operator-managed labels — canonical values live in status; these mirror
# for kubectl-selectability.
kopia.io/repository: nas-primary
kopia.io/backup-config: postgres-data
kopia.io/origin: operator
kopia.io/identity-hash: "a3f1..."
spec:
# Operator-initiated and manual: configRef + optional overrides.
# Discovered: spec is empty/absent.
configRef: { name: postgres-data }
tags:
reason: "scheduled-nightly"
# parameters: # optional per-run overrides on the recipe
# compressionOverride: none
failurePolicy: # G6 — per-run, not hard-coded
backoffLimit: 2
activeDeadlineSeconds: 7200
status:
phase: Succeeded # Pending | Running | Succeeded | Failed | Discovered
origin: operator # operator | manual | discovered — canonical
snapshot: # the kopia artifact
kopiaSnapshotID: k1f1ec0a8
identity:
username: "postgres-data"
hostname: "billing"
sourcePath: /data
timing:
startTime: 2026-05-24T02:13:00Z
endTime: 2026-05-24T02:18:42Z
durationSeconds: 342
stats: # populated from kopia's JSON output
sizeBytes: 4321098765
bytesNew: 12345678
filesNew: 1233
filesModified: 22
filesUnchanged: 998111
job: # operator/manual only; absent for discovered
name: backup-postgres-data-20260524-021300
attempts: 1
resolved: # frozen recipe values at run time (operator/manual)
repository: { name: nas-primary, namespace: backups }
sources:
- pvc: billing/postgres-data
sourcePath: /data
conditions:
- type: SourcesQuiesced
status: "True"
- type: SnapshotCreated
status: "True"
logTail: | # capped at ~4KB; full logs in the Job pod
Snapshot created: k1f1ec0a8
Total bytes: 4321098765
kubectl get backup shows everything — runs in flight, historical successes, failed attempts, and the discovered catalog — distinguished by the kopia.io/origin label and status.phase.
Why: addresses G17 (invocations as first-class, any trigger source), folds in the catalog representation cleanly, and means restores reference one kind of thing. Logs are bounded; full logs live in the Job pod where users expect them. Failed Backup CRs are durable evidence — failedJobsHistoryLimit on the schedule controls how many we keep.
3.4 BackupSchedule
Creates Backup CRs on a schedule in the BackupConfig's namespace.
apiVersion: kopia.io/v1alpha1
kind: BackupSchedule
metadata:
name: postgres-data-nightly
namespace: billing
spec:
configRef:
name: postgres-data
schedule:
cron: "H 2 * * *" # G4 — Jenkins-style 'H' substitution
jitter: 30m
timezone: "America/Los_Angeles"
runOnCreate: false # G3 — GitOps-friendly default
suspend: false
concurrencyPolicy: Forbid # Forbid | Allow | Replace — G18
startingDeadlineSeconds: 600
successfulJobsHistoryLimit: 10 # GC bound for origin: operator Backups from this schedule
failedJobsHistoryLimit: 3
status:
lastSchedule:
scheduledAt: 2026-05-24T02:13:00Z # cron + jitter, pinned (predictable for alerting)
backupRef: { name: postgres-data-20260524-021300 }
nextSchedule:
at: 2026-05-25T02:21:00Z
lastSuccessfulSchedule:
at: 2026-05-24T02:13:00Z
backupRef: { name: postgres-data-20260524-021300 }
consecutiveFailures: 0
conditions:
- type: ConfigResolvable
status: "True"
Why: mirrors CronJob semantics exactly (G4, G18). Schedule anchoring is wall-clock (cron(now)), not cron(lastSyncTime) — fixes volsync's drift behavior. Pinned scheduledAt lets ops alerts say "you missed the 02:13 slot" without ambiguity.
3.5 Restore
A restore from a Backup (or raw kopia identity) to a PVC.
apiVersion: kopia.io/v1alpha1
kind: Restore
metadata:
name: postgres-restore-2026-05-23
namespace: billing
spec:
# Optional. Derived from `source` when omitted (the Backup CR / BackupConfig CR
# knows its Repository). Required only with `source.identity`.
# repository: { name: nas-primary, namespace: backups }
# Exactly one of the following. Webhook-enforced.
source:
# Preferred: a Backup CR (operator-initiated, manual, or discovered — all same kind).
backupRef: { name: postgres-data-20260524-021300, namespace: billing }
# Or a BackupConfig CR — resolves via identity against the repo, even if no Backup
# CR has ever been created in this cluster (deploy-or-restore on a fresh cluster
# against an existing repo).
# fromConfig:
# name: postgres-data
# asOf: 2026-05-23T20:00:00Z
# offset: 0 # 0 = latest, 1 = previous, ...
# Or a raw kopia identity (works for foreign writers or snapshots that have
# aged out of the K8s-side catalog window).
# identity:
# username: postgres-data
# hostname: billing
# sourcePath: /data
# snapshotID: k1f1ec0a8 # or asOf / offset
# spec.repository is REQUIRED when using `identity`.
# Optional. Three modes.
target:
# Mode 1: operator creates the PVC.
pvc:
name: postgres-data-restored
storageClassName: fast-ssd
capacity: 100Gi
accessModes: [ReadWriteOnce]
# Mode 2: write into an existing PVC.
# pvcRef: { name: postgres-data-restored }
# Mode 3: omit `target` entirely — passive. A PVC with
# spec.dataSourceRef -> this Restore kicks off the populator handshake.
options:
enableFileDeletion: false
ignorePermissionErrors: true
writeFilesAtomically: true
policy:
onMissingSnapshot: Fail # default for explicit sources (backupRef/identity)
# For source.fromConfig the default is Continue (see §4.4 for the deploy-or-restore pattern).
waitTimeout: 5m
status:
phase: Restoring # Pending | Resolving | Restoring | Completed | Failed
resolved: # pinned at admission
backupRef: { name: postgres-data-20260524-021300, namespace: billing }
repository: { name: nas-primary, namespace: backups }
pinnedAt: 2026-05-24T17:33:11Z
identity:
username: postgres-data
hostname: billing
sourcePath: /data
target: # what the operator is writing into
pvcPrime: pvc-prime-9f8e2c1b # populator handshake (passive / pvc-create modes)
pvcRef: { name: postgres-data-restored }
timing:
startTime: 2026-05-24T17:33:14Z
progress:
bytesRestored: 8123456789
filesRestored: 998111
Why: addresses G7 (fail-closed defaults), G8/G9 (admission-time resolution, no drift on re-apply), G10 (single restore path covers populator and in-place uniformly). Three source modes cover the spectrum: K8s-native reference (backupRef), recipe-driven (fromConfig — the GitOps deploy-or-restore pattern), and raw kopia identity (foreign writers, aged-out catalog). spec.repository is derivable for the first two; required only when raw identity is the source.
3.6 Maintenance
apiVersion: kopia.io/v1alpha1
kind: Maintenance
metadata: { name: nas-primary, namespace: backups }
spec:
repository: { name: nas-primary }
schedule:
quick: { cron: "0 */6 * * *", jitter: 30m }
full: { cron: "0 3 * * 0", jitter: 1h }
timezone: UTC
ownership:
owner: "kopia-operator/nas-primary"
takeoverPolicy: PromptCondition # Never | PromptCondition | Force
mover:
resources: { requests: { cpu: 250m, memory: 1Gi }, limits: { cpu: "2", memory: 4Gi } }
failurePolicy:
backoffLimit: 1
activeDeadlineSeconds: 14400
status:
ownership:
owner: "kopia-operator/nas-primary"
claimedAt: 2026-05-12T08:14:02Z
quick:
lastRunAt: 2026-05-24T12:00:11Z
nextScheduledAt: 2026-05-24T18:00:00Z
consecutiveFailures: 0
full:
lastRunAt: 2026-05-19T03:01:42Z
nextScheduledAt: 2026-05-26T03:00:00Z
consecutiveFailures: 0
conditions:
- type: OwnershipClaimed
status: "True"
Why: at most one Maintenance per Repository (webhook-enforced) — kills the perfectra1n cross-namespace first-writer-wins race by making the conflict unrepresentable.
4. Key behaviors
4.1 Scheduling
- CronJob-style wall-clock anchoring (
cron(now)), not last-completion anchoring. Fixes volsync's drift. jitterderives deterministically fromBackupSchedule.UID + scheduledAtso HA operator replicas compute the same fire time.cron: "H * * * *"literalHsubstitution; result pinned instatus.lastSchedule.scheduledAt.runOnCreate: falseis the default (G3).concurrencyPolicy: Forbidis the default; skipped runs surface a condition rather than silently piling up (G5/G18).
4.2 Identity model
For a BackupConfig C in namespace N backing up PVC P:
usernamedefaults toC.metadata.name;hostnamedefaults toN.sourcePathdefaults to/pvc/<P>;sourcePathStrategy: PVCNamespacedNameis available for multi-namespace selectors.- Resolved identity always appears in
BackupConfig.status.resolved.identityandBackup.status.identity.
This is the part where a kopia-native operator can do better than volsync's accidental design — identity is the API, not an internal detail.
4.3 Repository sharing
Many BackupConfigs point at one Repository. Each writes under its own identity, so snapshots never collide. The repo is created lazily on first connect failure; the race is mediated by kopia's own object-store guarantees plus a per-repo lease in the operator. The RESTIC_HOST="volsync" anti-pattern doesn't apply.
4.4 Restore resolution & semantics
Restore.spec.source resolution at admission:
| Source mode | Resolution | spec.repository required? | Default onMissingSnapshot |
|---|---|---|---|
backupRef | Look up the Backup CR; derive repository from it | No | Fail |
fromConfig | Resolve identity from BackupConfig, query repo directly | No (derived from the BackupConfig) | Continue |
identity | Direct kopia query | Yes | Fail |
fromConfig + Continue is the deploy-or-restore pattern: apply Repository + BackupConfig + BackupSchedule + Restore + workload PVC together. Fresh cluster against an existing repo → PVC restored from latest. Fresh cluster against empty repo → PVC binds empty, BackupSchedule starts producing Backups under the same identity, and a future redeploy restores from there. No manifest changes between the two cases.
writeFilesAtomically: true is the default. ignorePermissionErrors: true is the default (and surfaces a condition if any errors occurred — non-silent).
4.5 Volume populator
To clarify the field's status on modern Kubernetes:
PVC.spec.dataSourceRefis GA since 1.24 via theAnyVolumeDataSourcefeature gate (default-on).- A populator controller (this operator) watches PVCs whose
dataSourceRefreferences its kind and runs thepvc-prime+claimRef-rebind handshake. kubernetes-csi/volume-data-source-validator(which ships thepopulator.storage.k8s.io/VolumePopulatorCRD) is optional. Without it, PVCs that mistype their populator ref hangPending. With it, they're rejected at admission. The actual data-moving machinery works either way.
Our position: the populator path works on any cluster ≥ 1.24 without installing anything extra. If the VolumePopulator CRD is present at operator startup, we register ourselves for the better UX; if absent, we log it and carry on. No hard dependency.
This addresses G10 by making the populator path uniform (passive Restore) and never gating it on copy-method.
4.6 Hooks
hooks.beforeSnapshot[] / hooks.afterSnapshot[] accept one of:
workloadExec—kubectl exec-style into a matched workload pod/container (the default and most-requested form, fork #22).runJob: { jobSpec: ... }— fullJobSpecto run as a one-shot Job (k8up-stylePreBackupPodanalog). NamedrunJobto make the materialization explicit.httpRequest— typed POST to a URL for cross-system orchestration.
Hook failures abort the backup by default; continueOnFailure: true is opt-in per hook.
4.7 Multi-PVC consistency
groupBy: VolumeGroupSnapshot is the default for multi-PVC sources. If the chosen volumeSnapshotClass's driver doesn't support VGS, the BackupConfig reports a GroupSnapshotUnsupported condition and refuses to run. Silently falling back to per-PVC snapshots would mean inconsistent backups — the same data-integrity hazard as #1211. To intentionally accept per-PVC snapshots, set groupBy: None explicitly.
4.8 Observability
| Surface | Volsync | This operator |
|---|---|---|
| Per-PVC metrics labels | absent (#518) | always present (pvc, pvc_namespace, backup_config, repository) |
| Stale-metrics-on-delete | yes (#1194) | metrics are CR-scoped, deleted with the CR |
lastSuccessAt | only derivable | BackupSchedule.status.lastSuccessfulSchedule first-class |
| Snapshot count | exec into pod | kubectl get backup -l kopia.io/backup-config=... |
| Logs | tail of last pod | small tail (~4KB) in Backup.status.logTail; full logs in Job pod |
| Repo storage stats | absent | Repository.status.storageStats |
SLO-friendly metrics:
kopia_operator_backup_last_success_timestamp_seconds{backup_config,namespace}— gauge.kopia_operator_backup_consecutive_failures{backup_config,namespace}— gauge.kopia_operator_restore_duration_seconds{...}— summary (p50/p90/p99).
4.9 Security & RBAC
- Operator is namespaced by default; cluster-scoped install is opt-in via Helm value.
- Mover pods run as
runAsNonRoot: true,runAsUser: 65534(nobody) by default. Files restored may not match original ownership — documented trade-off, not a hidden surprise. mover.securityContext: {...}— explicit override.mover.inheritSecurityContextFrom: { podSelector }— opt-in best-effort copy from a live consumer; fails loud (condition) if no pod matches at backup time. Not a default because the workload may be scaled to zero.mover.privilegedMode: true(namespace-gated bykopia.io/allow-privileged-movers: "true") — runs withCHOWN/FOWNERfor clean ownership restoration. Explicit double opt-in.lost+foundand similar system entries are skipped by default (fork #1033/#1889).
4.10 Mover pods & failure handling
- Jobs use
restartPolicy: NeverandbackoffLimit: spec.failurePolicy.backoffLimit(default2). concurrencyPolicy: Forbiddefault; missed slots produce aBackupSkippedcondition.activeDeadlineSecondsdefault7200.- Completed mover pods are reaped on the same reconcile that observes their terminal status — no zombie pods (fork #8).
4.11 Forward compatibility
Every credential, schedule, policy, and identity surface is a sub-object rather than a leaf field, so future fields slot in without changing the basic shape. Specifically deferred but accommodated:
Repository.spec.encryption.rotation— password rotation flow.Repository.spec.access.readOnly— read-only repo mode for restore-only operators.Repository.spec.backend.<type>.auth.workloadIdentity— IRSA/WIF (already structurally present; deprioritized for the homelab default).Backup.spec.parameters— typed run-time overrides beyond just tags.
The kopia.io API group itself is v1alpha1; webhook conversions to v1beta1/v1 will be additive only.
The API-server-dependency-on-the-operator's-webhook concern is bounded: webhooks intercept only kopia.io/* CRDs. PVC admission, populator dispatch, and in-flight restore reconciliation never depend on the webhook being up. Standard failurePolicy: Fail is appropriate.
5. Usage walkthroughs
5.1 Single PVC, scheduled daily
apiVersion: v1
kind: Secret
metadata: { name: nas-primary-creds, namespace: backups }
stringData:
AWS_ACCESS_KEY_ID: ...
AWS_SECRET_ACCESS_KEY: ...
KOPIA_PASSWORD: choose-something-long
---
apiVersion: kopia.io/v1alpha1
kind: Repository
metadata: { name: nas-primary, namespace: backups }
spec:
backend:
s3:
bucket: my-backups
prefix: prod/
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
auth: { secretRef: { name: nas-primary-creds } }
encryption: { passwordSecretRef: { name: nas-primary-creds, key: KOPIA_PASSWORD } }
create: { enabled: true }
---
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata: { name: postgres-data, namespace: billing }
spec:
repository: { name: nas-primary, namespace: backups }
sources: [{ pvc: { name: postgres-data } }]
retention: { keepDaily: 14, keepWeekly: 4 }
---
apiVersion: kopia.io/v1alpha1
kind: BackupSchedule
metadata: { name: postgres-data-nightly, namespace: billing }
spec:
configRef: { name: postgres-data }
schedule:
cron: "H 2 * * *"
jitter: 30m
runOnCreate: false
Maintenance is implicit — a default Maintenance is created on first reference to a Repository unless one already exists.
5.2 Restore by picking a backup
kubectl get backup -n billing \
-l kopia.io/backup-config=postgres-data \
--sort-by=.status.startTime
apiVersion: kopia.io/v1alpha1
kind: Restore
metadata: { name: postgres-restore-yesterday, namespace: billing }
spec:
source:
backupRef: { name: postgres-data-20260523-021300, namespace: billing }
target:
pvc:
name: postgres-data-restored
storageClassName: fast-ssd
capacity: 100Gi
accessModes: [ReadWriteOnce]
5.3 Multi-PVC selector
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata: { name: app-bundle, namespace: billing }
spec:
repository: { name: nas-primary, namespace: backups }
identity: { username: app-bundle, hostname: billing }
sources:
- pvcSelector:
labelSelector: { matchLabels: { backup: include } }
sourcePathStrategy: PVCName
groupBy: VolumeGroupSnapshot
retention: { keepDaily: 14 }
5.4 Deploy-or-restore (GitOps)
The headline pattern. Apply everything together; on a fresh cluster against an existing repo, the PVC restores; on a fresh repo, it comes up empty and gets backed up going forward.
apiVersion: kopia.io/v1alpha1
kind: BackupConfig
metadata: { name: postgres-data, namespace: billing }
spec:
repository: { name: nas-primary, namespace: backups }
sources: [{ pvc: { name: postgres-data } }]
---
apiVersion: kopia.io/v1alpha1
kind: BackupSchedule
metadata: { name: postgres-data-nightly, namespace: billing }
spec:
configRef: { name: postgres-data }
schedule: { cron: "H 2 * * *", jitter: 30m, runOnCreate: false }
---
apiVersion: kopia.io/v1alpha1
kind: Restore
metadata: { name: postgres-data-restore, namespace: billing }
spec:
source: { fromConfig: { name: postgres-data, offset: 0 } }
policy: { onMissingSnapshot: Continue } # default for fromConfig — explicit here for clarity
# No target — passive. The PVC below references this Restore.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata: { name: postgres-data, namespace: billing }
spec:
storageClassName: fast-ssd
resources: { requests: { storage: 100Gi } }
accessModes: [ReadWriteOnce]
dataSourceRef:
apiGroup: kopia.io
kind: Restore
name: postgres-data-restore
5.5 Manual one-shot backup
apiVersion: kopia.io/v1alpha1
kind: Backup
metadata: { name: postgres-data-pre-migration, namespace: billing }
spec:
configRef: { name: postgres-data }
tags: { reason: "pre-schema-migration" }
Equivalently from any external system: Argo Events Sensor, Tekton Task, GitHub Actions, webhook handler. The Backup CR is the universal entry point.
5.6 Restore from a discovered (foreign / pre-install) backup
# Discovered Backups live in the Repository's namespace because the operator
# has no reliable way to attribute them to a BackupConfig.
kubectl get backup -n backups -l kopia.io/origin=discovered
apiVersion: kopia.io/v1alpha1
kind: Restore
metadata: { name: rescue-restore, namespace: billing }
spec:
source:
backupRef: { name: kopia-disc-9c2a1f, namespace: backups }
target: { pvc: { name: rescue-pvc, storageClassName: fast-ssd, capacity: 50Gi, accessModes: [ReadWriteOnce] } }
5.7 Suspending a schedule via GitOps
spec:
schedule:
suspend: true # apply via PR; un-suspend in a follow-up PR
In-flight Backups are unaffected; only future cron firings are skipped.
6. Consequences
6.1 Positive
- Kopia-native ergonomics: identity, policy, hooks, snapshot listing all map to user mental models 1:1 with kopia.
- Three-layer triggering — any source can fire a
Backup. Solves#1559and#627together. - GitOps deploy-or-restore is a single-manifest pattern, not a runbook.
kubectl get backupis the one place to look — runs, history, and the catalog all live there.- Multi-PVC, multi-namespace, one-repo: first-class.
- VolumeGroupSnapshot on by default; degrades loudly, never silently.
- No bash mover scripts.
6.2 Negative / trade-offs
- Five CRDs to volsync's two — discoverability cost. Mitigated by hiding
Maintenancefrom the typical first-time user flow (the simple case in §5.1 doesn't reference it explicitly) and by overloadingBackupto cover both the "operator made this" and "we found this in the repo" cases. origin: discoveredBackupCRs add etcd load. Mitigated byRepository.spec.catalog.retainbounds.- Webhook resolution of
Restore.source.backupRefmeans restoring a snapshot just outside the catalog window requires the raw-identity escape hatch. Documented. - Kopia version pinning is a single choice baked into the operator image; mitigated by a
Repository.spec.kopiaImageOverrideescape hatch for advanced users. - Coexists with volsync rather than supplanting it; users wanting rsync/syncthing keep volsync.
7. References
- VolSync upstream: https://github.com/backube/volsync
- VolSync kopia fork (perfectra1n): https://github.com/perfectra1n/volsync
- Kopia mover PR: https://github.com/backube/volsync/pull/1723
- Trigger redesign proposal: https://github.com/backube/volsync/issues/1559
- Kopia: https://kopia.io/
- CloudNativePG (Backup/ScheduledBackup pattern): https://cloudnative-pg.io/documentation/current/backup/
- KEP-1495 AnyVolumeDataSource: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1495-volume-populators
kubernetes-csi/volume-data-source-validator: https://github.com/kubernetes-csi/volume-data-source-validatorkubernetes-csi/lib-volume-populator: https://github.com/kubernetes-csi/lib-volume-populator
Appendix A: Field-by-field comparison vs volsync
| Concern | volsync | this operator |
|---|---|---|
| Repo as a resource | Secret reference | Repository CRD |
| Triggering layers | one (trigger field on source) | three (BackupConfig / Backup / BackupSchedule) |
| Manual / external trigger | trigger.manual: <value> string-change | kubectl create backup (G17, #1559) |
| Multi-PVC | not supported | pvcSelector + groupBy: VolumeGroupSnapshot default |
| Multi-PVC consistency on unsupported drivers | n/a | fail loud (GroupSnapshotUnsupported); groupBy: None opts into per-PVC |
| First-sync skip | not supported (#627) | runOnCreate: false default |
| Cron jitter | not supported (#1421) | jitter + H substitution |
| Cron timezone | not supported (#702) | schedule.timezone |
| Schedule anchor | last-completion | wall-clock (CronJob-style) |
| Concurrency policy | implicit | concurrencyPolicy: Forbid default |
| BackoffLimit | hard-coded 8 | per-Backup failurePolicy.backoffLimit |
| Snapshot as a K8s object | absent | Backup CR (operator-initiated, manual, or discovered) |
| Restore by snapshot reference | no — only restoreAsOf | source.backupRef |
| Restore on a fresh cluster against an existing repo | manual runbook | source.fromConfig resolves via identity |
| Missing snapshot at restore | silently succeeds with empty PVC | Fail default; Continue for fromConfig (deploy-or-restore) |
| Maintenance | embedded in mover pod | own Maintenance CRD |
| Maintenance ownership | implicit | explicit lease + status |
| Snapshot catalog | exec into pod | Backup CRs (origin: discovered), bounded materialization |
| Hooks | shell-string in mover | typed workloadExec / runJob / httpRequest |
| Policy passthrough | restic flags only | typed policy.* + extraArgs escape hatch |
| Per-PVC metrics | absent (#518) | first-class labels |
| Stale metrics | observed (#1194) | metric-per-CR, cleared on delete |
| Mover image default tag | :latest | digest-pinned per operator release |
| Zombie mover pods | observed (fork #8) | reaped on terminal-status reconcile |
| Lost+found root files | break unprivileged restore (#1033) | skipped by default |
| Restore target modes | destinationPVC (in-place) OR populator | target.pvcRef / target.pvc / passive (populator) — uniform |
| Populator dependency | requires reading the docs to know | clarified: works on any 1.24+ cluster; volume-data-source-validator is an optional UX nicety |