Your production application is experiencing a critical outage. Users are reporting 500 errors, the application is slow, and monitoring shows high error rates. You need to respond quickly, diagnose the issue, and restore service.
Step 1: Acknowledge Incident
# Create incident channel/thread
# Notify team
# Set severity level (P1 - Critical)
Step 2: Gather Initial Information
# Check monitoring dashboards
# Review recent alerts
# Check deployment history
# Review recent changes
Key Questions:
Step 3: Initial Assessment
# Check application logs
aws logs tail /aws/ec2/application --follow
# Check error rates
# Check database status
# Check infrastructure health
Step 1: Stop the Bleeding
Option A: Scale Up (If Resource Constraint)
# Increase Auto Scaling Group capacity
aws autoscaling set-desired-capacity \
--auto-scaling-group-name app-asg \
--desired-capacity 20 \
--honor-cooldown
# Or manually add instances
Option B: Enable Maintenance Mode
# Show maintenance page to users
# Route traffic away from affected services
# Use feature flags to disable features
Option C: Rollback Recent Deployment
# If recent deployment, rollback
kubectl rollout undo deployment/web-app
# or
aws ecs update-service --service web-app --force-new-deployment --task-definition previous-version
Step 2: Isolate Affected Components
# Remove unhealthy instances from load balancer
aws elbv2 deregister-targets \
--target-group-arn arn:aws:elasticloadbalancing:... \
--targets Id=i-1234567890abcdef0
# Or in Kubernetes
kubectl delete pod <unhealthy-pod>
Step 1: Check Application Logs
# CloudWatch Logs Insights query
fields @timestamp, @message
| filter @message like /ERROR/
| filter @timestamp > date_sub(now(), 15m)
| stats count() by @message
# Or using kubectl
kubectl logs -l app=web-app --tail=1000 | grep ERROR
What to Look For:
Step 2: Check Database
# Check database connections
aws rds describe-db-instances --db-instance-identifier prod-db
# Check slow queries
# Connect to database and run:
SHOW PROCESSLIST;
SHOW FULL PROCESSLIST;
# Check for locks
SELECT * FROM information_schema.innodb_locks;
# Check connection count
SHOW STATUS LIKE 'Threads_connected';
Step 3: Check Infrastructure
# Check EC2 instances
aws ec2 describe-instance-status --instance-ids i-1234567890
# Check CPU, memory, disk
# Use CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-01T01:00:00Z \
--period 300 \
--statistics Average
# Check load balancer
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:...
Step 4: Check Recent Changes
# Git history
git log --oneline -10
# Deployment history
kubectl rollout history deployment/web-app
# CloudTrail (recent API calls)
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances \
--max-results 10
Common Root Causes:
1. Database Connection Pool Exhausted
Symptoms:
Investigation:
-- Check active connections
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';
-- Check for long-running queries
SELECT * FROM information_schema.processlist
WHERE time > 30
ORDER BY time DESC;
-- Check for locks
SHOW ENGINE INNODB STATUS;
Resolution:
2. Memory Leak / Out of Memory
Symptoms:
Investigation:
# Check memory usage
free -h
top
docker stats
# Check for memory leaks
# Application profiling
# Heap dumps
Resolution:
3. Slow Database Queries
Symptoms:
Investigation:
-- Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;
-- Check slow queries
SELECT * FROM mysql.slow_log ORDER BY start_time DESC LIMIT 10;
-- Check indexes
EXPLAIN SELECT * FROM orders WHERE user_id = 123;
Resolution:
4. Recent Code Deployment
Symptoms:
Investigation:
# Check deployment time vs incident time
# Review code changes
git diff <previous-commit> <current-commit>
# Check if canary deployment affected
Resolution:
Example: Database Connection Pool Issue
Immediate Fix:
# 1. Increase connection pool (application config)
# Update application configuration
# Restart application
# 2. Kill idle connections
mysql -e "KILL <connection_id>;"
# 3. Increase database max_connections
aws rds modify-db-instance \
--db-instance-identifier prod-db \
--apply-immediately \
--db-parameter-group-name new-params
Permanent Fix:
// Fix connection leak in code
// Ensure connections are closed
try (Connection conn = dataSource.getConnection()) {
// use connection
} // automatically closed
// Or use connection pool properly
// Set appropriate pool size
// Monitor connection usage
Verification:
# Monitor error rate
# Should decrease
# Check database connections
SHOW STATUS LIKE 'Threads_connected';
# Check application health
curl https://api.example.com/health
# Monitor metrics
# Error rate should return to normal
Step 1: Verify Resolution
# Check error rates (should be <1%)
# Check response times (should be <500ms)
# Check user reports (should decrease)
# Test critical user flows
Step 2: Gradual Traffic Restoration
# If traffic was diverted, gradually restore
# Monitor closely
# Keep maintenance mode option ready
Step 3: Enhanced Monitoring
# Set up additional alerts
# Monitor key metrics closely
# Watch for recurrence
Step 1: Document Incident
Incident Report Template:
# Incident Report: [Title]
## Summary
- **Date/Time**: [Start] - [End]
- **Duration**: [X hours Y minutes]
- **Severity**: P1
- **Impact**: [Users affected, revenue loss, etc.]
## Timeline
- [Time] - Incident detected
- [Time] - Team notified
- [Time] - Containment started
- [Time] - Root cause identified
- [Time] - Resolution implemented
- [Time] - Service restored
## Root Cause
[Detailed explanation]
## Resolution
[What was done to fix]
## Impact
- Users affected: [X]
- Revenue impact: [$X]
- Downtime: [X minutes]
## Lessons Learned
- What went well
- What could be improved
- Action items
## Action Items
- [ ] Fix connection leak
- [ ] Add monitoring
- [ ] Update runbooks
- [ ] Improve alerting
Step 2: Action Items
Step 3: Follow-up
# Set up alerts for:
- Error rate > 1%
- Response time > 1s
- Database connections > 80%
- CPU > 80%
- Memory > 80%