Service Overview

Service: Coffee Grind API

Owner: Platform Team

Tier: Tier 1

Regions: us-east-1 (primary), us-west-2 (failover)

Dashboard: Grafana Dashboard

Symptoms

Diagnosis Steps

  1. Check regional health: Verify which region is reporting errors. Check the Grafana dashboard for per-region breakdown.
  2. Check upstream dependencies: The Coffee Grind API depends on the Bean Inventory Service. Verify Bean Inventory is healthy in the affected region.
  3. Check infrastructure: Review ECS task status, ALB target group health, and RDS connection pool metrics.
  4. Check recent deployments: Review the last 2 hours of deployments in the affected region.
  5. Check auth/credentials: Verify the service can authenticate to downstream dependencies (DynamoDB, S3). Look for 403 errors in application logs.

⚠️ Important: If errors persist after failover, the issue may not be regional. Check for global configuration changes (IAM policies, security groups, credential rotation) before escalating.

Mitigation: Regional Failover

Pre-flight Checks