Product Guide

AI Trust Registry Operations Runbook

SLOs, alert rules, incident handling, and operational checks for reliable registry scoring.

Last updated Mar 4, 2026

Operational SLOs

  • P95 scoring latency: 800ms or less.
  • Webhook delivery success: 99.9% over 24h.
  • Score freshness: 95% of active entities updated inside policy interval.

Alerting Rules

  1. score_pipeline_error_rate > 2% for 5 minutes.
  2. confidence_floor_breach_count > threshold per hour.
  3. webhook_retry_queue_depth > 500.

Incident Playbook

Scoring Degradation Incident

1Detect alert
2Identify failing framework connector
3Switch to degraded mode
4Keep last-known score and mark confidence_reduced
5Notify ops and affected tenants
6Restore connector
7Backfill missed evaluations

Webhook Delivery Incident

  1. Pause non-critical event dispatch.
  2. Validate signature key configuration.
  3. Drain retry queue with controlled backoff.
  4. Reconcile event ids for idempotency gaps.

Daily Operator Checklist

  • Review top entities by score delta.
  • Review low-confidence queue trend.
  • Confirm no backlog in retry/dead-letter queues.
  • Validate policy threshold changes are audited.