devops-interview-handbook

Architecture Exercise: Design Multi-Region Disaster Recovery

Problem Statement

Design a disaster recovery solution for a critical financial application that processes $1 billion in transactions annually. The system must survive regional disasters, maintain data consistency, and meet strict RTO/RPO requirements.

Requirements

Functional Requirements

  1. Application Availability
    • Web application (customer-facing)
    • API services
    • Background job processing
    • Real-time transaction processing
  2. Data Requirements
    • Customer accounts
    • Transaction history
    • Financial records
    • Audit logs
  3. Compliance
    • Financial regulations
    • Data retention (7 years)
    • Audit trails
    • Data sovereignty

Non-Functional Requirements

  1. RPO (Recovery Point Objective): < 1 minute (near-zero data loss)
  2. RTO (Recovery Time Objective): < 15 minutes
  3. Availability: 99.99% (52.56 minutes downtime/year)
  4. Data Consistency: Strong consistency required
  5. Geographic Distribution: Primary US, Secondary EU
  6. Compliance: PCI DSS, SOX, GDPR

Constraints and Assumptions

Constraints

Assumptions

Reference Architecture

Multi-Region Active-Active Architecture

                    Global Users
                         |
                  [Route 53 DNS]
              (Latency-Based Routing)
                         |
        ┌─────────────────┴─────────────────┐
        |                                   |
   [US Region]                        [EU Region]
   (Primary)                          (Secondary)
        |                                   |
   ┌────┴────┐                        ┌────┴────┐
   |         |                        |         |
[ALB]    [ALB]                    [ALB]    [ALB]
   |         |                        |         |
   └────┬────┘                        └────┬────┘
        |                                   |
   ┌────┴────┐                        ┌────┴────┐
   |         |                        |         |
[App]    [App]                    [App]    [App]
(AZ-1)  (AZ-2)                  (AZ-1)  (AZ-2)
   |         |                        |         |
   └────┬────┘                        └────┬────┘
        |                                   |
   ┌────┴────┐                        ┌────┴────┐
   |         |                        |         |
[RDS]    [RDS]                    [RDS]    [RDS]
Primary  Replica                Replica  Replica
        |                                   |
        └───────────┬───────────────────────┘
                    |
            [Global Database]
        (Cross-Region Replication)
                    |
        ┌────────────┴────────────┐
        |                         |
   [S3 US]                   [S3 EU]
   (Backup)                 (Backup)
        |                         |
        └────────────┬────────────┘
                     |
            [S3 Cross-Region Replication]

Component Breakdown

1. DNS and Traffic Routing

Component: Route 53 (AWS)

Configuration:

High Availability:

2. Application Layer

Component: ECS/EKS with Auto Scaling

US Region (Primary):

EU Region (Secondary):

Configuration:

3. Database Layer

Component: RDS PostgreSQL with Cross-Region Replication

US Region (Primary):

EU Region (Replica):

Global Database (Aurora Global Database):

Configuration:

-- US Primary
CREATE PUBLICATION global_publication FOR ALL TABLES;

-- EU Replica
CREATE SUBSCRIPTION global_subscription
  CONNECTION 'host=us-db.example.com dbname=mydb'
  PUBLICATION global_publication;

RPO Achievement:

4. Caching Layer

Component: ElastiCache Redis with Global Datastore

Configuration:

Alternative: Redis with Replication

5. Message Queue

Component: Amazon SQS / RabbitMQ with Federation

Configuration:

Alternative: Kafka with MirrorMaker

6. Object Storage

Component: S3 with Cross-Region Replication

Configuration:

Backup Strategy:

7. Monitoring and Alerting

Component: CloudWatch, X-Ray, Custom Dashboards

Monitoring:

Alerts:

8. Disaster Recovery Automation

Component: Lambda Functions, EventBridge

Failover Automation:

def handle_regional_failure(event):
    # 1. Detect failure (health checks)
    if us_region_health_check_failed():
        # 2. Promote EU replica to primary
        promote_eu_replica_to_primary()
        
        # 3. Update Route 53 to failover
        route53_failover_to_eu()
        
        # 4. Scale up EU region
        scale_up_eu_region()
        
        # 5. Notify team
        send_alert("Failover to EU region")
        
        # 6. Verify health
        verify_eu_region_health()

Disaster Recovery Scenarios

Scenario 1: Regional Outage (US Region)

Detection:

Recovery Steps:

  1. Detect Failure (< 1 minute)
    • Health checks fail
    • Alert triggered
  2. Promote EU Database (< 2 minutes)
    • Stop replication from US
    • Promote EU replica to primary
    • Verify data consistency
  3. Update DNS (< 1 minute)
    • Route 53 failover to EU
    • DNS propagation
  4. Scale EU Region (< 5 minutes)
    • Scale application to full capacity
    • Verify health
  5. Verify Service (< 2 minutes)
    • Smoke tests
    • Monitor metrics

Total RTO: < 15 minutes

Scenario 2: Database Failure (US Primary)

Detection:

Recovery Steps:

  1. Failover to US Replica (< 1 minute)
    • RDS Multi-AZ automatic failover
    • Application reconnects automatically
  2. If US Replica Fails (< 5 minutes)
    • Promote EU replica to primary
    • Update application connection strings
    • Update DNS if needed

Total RTO: < 5 minutes (US replica) or < 15 minutes (EU)

Scenario 3: Network Partition

Detection:

Recovery Steps:

  1. Continue Operating (both regions)
    • US handles US traffic
    • EU handles EU traffic
    • Resolve conflicts when network restored
  2. Conflict Resolution:
    • Use timestamps
    • Manual review if needed
    • Merge strategies

Discussion Points

Data Consistency Trade-offs

Strong Consistency (Chosen):

Eventual Consistency:

Decision: Strong consistency for financial data (regulatory requirement)

Active-Active vs Active-Passive

Active-Active (Chosen):

Active-Passive:

Decision: Active-Active for better performance and cost

Replication Strategies

Synchronous Replication:

Asynchronous Replication:

Decision: Hybrid approach (sync within region, async cross-region)

Cost Considerations

Estimated Monthly Cost:

Optimization:

Testing Strategy

Regular DR Drills:

Test Scenarios:

  1. Database failover
  2. Regional failover
  3. Network partition
  4. Data corruption recovery

Success Criteria:

Implementation Phases

Phase 1: Single Region Multi-AZ

Phase 2: Cross-Region Replication

Phase 3: Active-Active

Phase 4: Automation

Key Takeaways

  1. RPO/RTO: Define clear objectives
  2. Replication: Choose appropriate strategy
  3. Automation: Automate failover when possible
  4. Testing: Regular DR drills
  5. Monitoring: Comprehensive observability
  6. Documentation: Detailed runbooks
  7. Cost: Balance cost vs requirements

Follow-up Questions

  1. How do you handle data conflicts in active-active?
  2. How do you test DR without impacting production?
  3. How do you ensure compliance during failover?
  4. How do you handle partial failures?
  5. How do you optimize costs while maintaining DR?