Product Guide
AI Trust Registry Operations Runbook
SLOs, alert rules, incident handling, and operational checks for reliable registry scoring.
Last updated Mar 4, 2026
Operational SLOs
- P95 scoring latency: 800ms or less.
- Webhook delivery success: 99.9% over 24h.
- Score freshness: 95% of active entities updated inside policy interval.
Alerting Rules
score_pipeline_error_rate > 2%for 5 minutes.confidence_floor_breach_count > thresholdper hour.webhook_retry_queue_depth > 500.
Incident Playbook
Scoring Degradation Incident
1Detect alert
2Identify failing framework connector
3Switch to degraded mode
4Keep last-known score and mark confidence_reduced
5Notify ops and affected tenants
6Restore connector
7Backfill missed evaluations
Webhook Delivery Incident
- Pause non-critical event dispatch.
- Validate signature key configuration.
- Drain retry queue with controlled backoff.
- Reconcile event ids for idempotency gaps.
Daily Operator Checklist
- Review top entities by score delta.
- Review low-confidence queue trend.
- Confirm no backlog in retry/dead-letter queues.
- Validate policy threshold changes are audited.