Skip to content

Retention Operations Runbook

Date: 2026-03-04

Purpose

Define how to operate diff-retention cleanup safely in production.

Controls

  • DIFFVER_RETENTION_CLEANUP_EVERY_SECONDS
  • 0 disables background cleanup in worker loop.
  • Recommended start: 300 (5 minutes), then tune.
  • DIFFVER_RETENTION_FULL_MODE_DAYS
  • Max age for full retention diffs before purge.
  • Recommended baseline: 30.
  • DIFFVER_ENABLE_ADMIN_ENDPOINTS=1
  • Enables manual run endpoint:
  • POST /v1/admin/retention/run-once?full_mode_days=30

Normal Operation

  1. Keep background cleanup enabled in worker deployment.
  2. Track retention.cleanup_run audit events for each execution.
  3. Monitor GET /v1/admin/metrics:
  4. retention_runs
  5. retention_failures
  6. retention_last_purged_diffs
  7. retention_last_duration_ms

Alert Thresholds

  • Critical:
  • retention_failures increases in 2+ consecutive intervals.
  • No retention.cleanup_run event for > 2x expected interval.
  • Warning:
  • retention_last_duration_ms > 30,000 ms (sustained).
  • retention_last_purged_diffs remains 0 unexpectedly when backlog exists.

Failure Recovery

  1. Confirm worker process is alive and DIFFVER_RETENTION_CLEANUP_EVERY_SECONDS > 0.
  2. Run manual cleanup once:
  3. POST /v1/admin/retention/run-once?full_mode_days=<value>
  4. Check audit trail:
  5. retention.cleanup_run
  6. retention.cleanup_failed
  7. Validate storage permissions for DIFFVER_DIFFS_DIR.
  8. If persistent failures occur:
  9. Temporarily disable automated cleanup (...EVERY_SECONDS=0),
  10. Run manual cleanup during maintenance window,
  11. Fix storage/runtime issue and re-enable scheduler.

Verification Checklist

  • [ ] Artifact retrieval still works after diff purge.
  • [ ] Policy mode artifact_only purges diff immediately after scan finalization.
  • [ ] Policy mode full purges only after threshold age.
  • [ ] Audit events are emitted for both success and failure paths.