Service Overview
Service: Coffee Grind API
Owner: Platform Team
Tier: Tier 1
Regions: us-east-1 (primary), us-west-2 (failover)
Dashboard: Grafana Dashboard
Symptoms
- Elevated p99 latency (>500ms) on
/grind endpoint
- 5xx error rate exceeds 1% from a single region
- Health check failures from one region while other remains healthy
- Grafana alert:
CoffeeGrindAPI_RegionalDegradation
Diagnosis Steps
- Check regional health: Verify which region is reporting errors. Check the Grafana dashboard for per-region breakdown.
- Check upstream dependencies: The Coffee Grind API depends on the Bean Inventory Service. Verify Bean Inventory is healthy in the affected region.
- Check infrastructure: Review ECS task status, ALB target group health, and RDS connection pool metrics.
- Check recent deployments: Review the last 2 hours of deployments in the affected region.
- Check auth/credentials: Verify the service can authenticate to downstream dependencies (DynamoDB, S3). Look for 403 errors in application logs.
⚠️ Important: If errors persist after failover, the issue may not be regional. Check for global configuration changes (IAM policies, security groups, credential rotation) before escalating.
Mitigation: Regional Failover
Pre-flight Checks