Runbook: Coffee Grind API — Regional Failover

Service Overview

Service: Coffee Grind API

Owner: Platform Team

Tier: Tier 1

Regions: us-east-1 (primary), us-west-2 (failover)

Dashboard: Grafana Dashboard

Symptoms

Elevated p99 latency (>500ms) on /grind endpoint
5xx error rate exceeds 1% from a single region
Health check failures from one region while other remains healthy
Grafana alert: CoffeeGrindAPI_RegionalDegradation

Diagnosis Steps

Check regional health: Verify which region is reporting errors. Check the Grafana dashboard for per-region breakdown.
Check upstream dependencies: The Coffee Grind API depends on the Bean Inventory Service. Verify Bean Inventory is healthy in the affected region.
Check infrastructure: Review ECS task status, ALB target group health, and RDS connection pool metrics.
Check recent deployments: Review the last 2 hours of deployments in the affected region.
Check auth/credentials: Verify the service can authenticate to downstream dependencies (DynamoDB, S3). Look for 403 errors in application logs.

⚠️ Important: If errors persist after failover, the issue may not be regional. Check for global configuration changes (IAM policies, security groups, credential rotation) before escalating.

Service Overview

Symptoms

Diagnosis Steps

Mitigation: Regional Failover

Pre-flight Checks