Retention Operations Runbook¶
Date: 2026-03-04
Purpose¶
Define how to operate diff-retention cleanup safely in production.
Controls¶
DIFFVER_RETENTION_CLEANUP_EVERY_SECONDS0disables background cleanup in worker loop.- Recommended start:
300(5 minutes), then tune. DIFFVER_RETENTION_FULL_MODE_DAYS- Max age for
fullretention diffs before purge. - Recommended baseline:
30. DIFFVER_ENABLE_ADMIN_ENDPOINTS=1- Enables manual run endpoint:
POST /v1/admin/retention/run-once?full_mode_days=30
Normal Operation¶
- Keep background cleanup enabled in worker deployment.
- Track
retention.cleanup_runaudit events for each execution. - Monitor
GET /v1/admin/metrics: retention_runsretention_failuresretention_last_purged_diffsretention_last_duration_ms
Alert Thresholds¶
- Critical:
retention_failuresincreases in 2+ consecutive intervals.- No
retention.cleanup_runevent for > 2x expected interval. - Warning:
retention_last_duration_ms> 30,000 ms (sustained).retention_last_purged_diffsremains0unexpectedly when backlog exists.
Failure Recovery¶
- Confirm worker process is alive and
DIFFVER_RETENTION_CLEANUP_EVERY_SECONDS > 0. - Run manual cleanup once:
POST /v1/admin/retention/run-once?full_mode_days=<value>- Check audit trail:
retention.cleanup_runretention.cleanup_failed- Validate storage permissions for
DIFFVER_DIFFS_DIR. - If persistent failures occur:
- Temporarily disable automated cleanup (
...EVERY_SECONDS=0), - Run manual cleanup during maintenance window,
- Fix storage/runtime issue and re-enable scheduler.
Verification Checklist¶
- [ ] Artifact retrieval still works after diff purge.
- [ ] Policy mode
artifact_onlypurges diff immediately after scan finalization. - [ ] Policy mode
fullpurges only after threshold age. - [ ] Audit events are emitted for both success and failure paths.