devops-interview-handbook

AWS Interview Questions

Table of Contents


EC2 & Compute

Q1: What is the difference between EC2 instance types and when would you choose one over another?

Difficulty: Mid

Answer:

EC2 instance types are optimized for different use cases:

Real-world Context: Choose based on workload characteristics. A web application might use t3.medium, while a Redis cache cluster would use r5.large.

Follow-up: How do you determine the right instance size? (Use CloudWatch metrics, load testing, start small and scale)


Q2: Explain the difference between EBS volume types and their use cases.

Difficulty: Mid

Answer:

EBS volume types differ in performance characteristics:

Real-world Context: Production database → io2. Web server boot volume → gp3. Log aggregation → st1.

Follow-up: How do you migrate from gp2 to gp3? (Create snapshot, restore to gp3, or modify volume type)


Q3: What is an Auto Scaling Group and how does it work?

Difficulty: Mid

Answer:

An Auto Scaling Group (ASG) automatically adjusts the number of EC2 instances based on demand or schedules.

Components:

How it works:

  1. CloudWatch alarms trigger scaling policies
  2. ASG launches/terminates instances based on metrics (CPU, memory, custom)
  3. New instances are registered with load balancer
  4. Unhealthy instances are replaced automatically

Real-world Context: Web application behind ALB. Scale from 2 to 10 instances during peak hours based on CPU utilization.

Follow-up: What’s the difference between target tracking and step scaling? (Target tracking maintains a metric at target value, step scaling uses multiple thresholds)


Q4: Explain VPC, Subnets, Internet Gateway, and NAT Gateway.

Difficulty: Mid

Answer:

VPC (Virtual Private Cloud): Isolated network environment in AWS. You control IP ranges, subnets, routing, security.

Subnets: Logical subdivision of VPC IP range. Can be public (route to IGW) or private (route to NAT).

Internet Gateway (IGW): Allows communication between VPC and internet. One per VPC. Provides 1:1 NAT for public IPs.

NAT Gateway: Allows private subnets to access internet for updates/downloads while remaining inaccessible from internet. Managed service, highly available in one AZ (or use NAT Instance for multi-AZ).

Architecture:

Real-world Context: Web servers in public subnets, databases in private subnets. NAT Gateway allows DB instances to download patches.

Follow-up: What’s the difference between NAT Gateway and NAT Instance? (NAT Gateway is managed, scales automatically, more expensive. NAT Instance is EC2-based, requires management, cheaper for low traffic)


Q5: What is the difference between Security Groups and NACLs?

Difficulty: Mid

Answer:

Security Groups:

NACLs (Network ACLs):

Use Cases:

Real-world Context: Security Groups protect instances. NACLs add extra layer - block known malicious IPs at subnet level.

Follow-up: If you block port 80 in NACL but allow it in Security Group, what happens? (Traffic is blocked - NACL is evaluated first)


S3 & Storage

Q6: Explain S3 storage classes and when to use each.

Difficulty: Mid

Answer:

S3 Standard: 99.99% availability, 99.999999999% durability. For frequently accessed data. Low latency, high throughput.

S3 Intelligent-Tiering: Automatically moves objects between access tiers. For unpredictable access patterns. Small monitoring fee.

S3 Standard-IA (Infrequent Access): Lower cost for infrequently accessed data. 99.9% availability. Minimum 30-day storage, retrieval fee.

S3 One Zone-IA: Like Standard-IA but stored in single AZ. 99.5% availability. 20% cheaper. For non-critical, reproducible data.

S3 Glacier Instant Retrieval: Archive with millisecond retrieval. For rarely accessed data requiring immediate access. Minimum 90-day storage.

S3 Glacier Flexible Retrieval: 3 retrieval options (expedited 1-5 min, standard 3-5 hours, bulk 5-12 hours). For archives, backups.

S3 Glacier Deep Archive: Lowest cost. 12-hour retrieval. For long-term compliance archives.

Real-world Context: Active application data → Standard. Logs older than 30 days → Standard-IA. Compliance archives → Glacier Deep Archive.

Follow-up: How does lifecycle policy work? (Automatically transitions objects between classes based on age/prefix)


Q7: What is S3 versioning and why is it important?

Difficulty: Junior

Answer:

S3 versioning keeps multiple versions of an object with the same key. Each version has a unique version ID.

Benefits:

How it works:

Real-world Context: Developer accidentally overwrites production config file. With versioning, restore previous version immediately.

Follow-up: How do you enable versioning on existing bucket? (Enable versioning, existing objects become version with null version ID)


Q8: Explain S3 cross-region replication and its use cases.

Difficulty: Mid

Answer:

S3 Cross-Region Replication (CRR) automatically replicates objects to another region.

Requirements:

Use Cases:

How it works:

Real-world Context: GDPR requires EU data in EU region. Replicate from us-east-1 to eu-west-1.

Follow-up: What happens if you delete an object? (Delete marker is replicated, object appears deleted in both regions)


VPC & Networking

Q9: What is a VPC Peering connection and how does it work?

Difficulty: Mid

Answer:

VPC Peering connects two VPCs using private IP addresses, as if they’re in the same network.

Characteristics:

Setup:

  1. Request peering connection (initiator)
  2. Accept peering connection (accepter)
  3. Update route tables in both VPCs
  4. Update security groups/NACLs to allow traffic

Use Cases:

Real-world Context: Development VPC needs access to shared services VPC (databases, monitoring).

Follow-up: How do you connect 3 VPCs? (Need 3 peering connections - not transitive. Or use Transit Gateway)


Q10: What is AWS Transit Gateway and when would you use it?

Difficulty: Senior

Answer:

Transit Gateway is a network transit hub that connects VPCs, VPNs, and Direct Connect.

Benefits:

Use Cases:

Architecture:

Real-world Context: Organization with 50 VPCs. Instead of 1,225 peering connections, use Transit Gateway with one attachment per VPC.

Follow-up: How do you implement network segmentation with Transit Gateway? (Use multiple route tables, attach VPCs to different route tables)


IAM & Security

Q11: Explain IAM roles vs IAM users and when to use each.

Difficulty: Mid

Answer:

IAM Users:

IAM Roles:

Key Differences:

Best Practice: Use roles for everything possible. Only use users when absolutely necessary.

Real-world Context: EC2 instance needs S3 access → IAM Role attached to instance. Developer needs console access → IAM User with MFA.

Follow-up: How do you assume a role? (Use AWS STS AssumeRole API, or attach role to service)


Q12: What is the principle of least privilege in IAM?

Difficulty: Junior

Answer:

Principle of least privilege: Grant only the minimum permissions necessary to perform a task.

Implementation:

Example:

Benefits:

Real-world Context: Lambda function only needs to read from one S3 bucket. Grant s3:GetObject on that bucket only, not s3:*.

Follow-up: How do you audit IAM permissions? (Use IAM Access Analyzer, CloudTrail, or IAM policy simulator)


Q13: Explain IAM policy evaluation logic.

Difficulty: Senior

Answer:

IAM evaluates policies in this order:

  1. Default Deny: All requests denied by default
  2. Explicit Deny: Any explicit Deny overrides everything
  3. Explicit Allow: Grants permission
  4. Default Deny: If no Allow, request denied

Evaluation Process:

Key Points:

Real-world Context: User has Allow in user policy, but Deny in group policy → Request denied (explicit Deny wins).

Follow-up: What’s the difference between permissions boundary and policy? (Boundary sets maximum permissions, policy grants permissions within boundary)


Lambda & Serverless

Q14: What are Lambda cold starts and how do you minimize them?

Difficulty: Mid

Answer:

Cold Start: Time to initialize Lambda execution environment (download code, initialize runtime, run initialization code).

Factors Affecting Cold Starts:

Minimization Strategies:

  1. Optimize package size: Remove unused dependencies, use layers for common code
  2. Choose faster runtime: Node.js/Python over Java
  3. Avoid VPC if possible: Adds 10+ seconds
  4. Use Provisioned Concurrency: Pre-warms execution environments
  5. Optimize initialization: Move heavy initialization outside handler
  6. Keep functions warm: CloudWatch Events ping function periodically

Real-world Context: API Gateway → Lambda. Cold start adds 2-3 seconds. Use Provisioned Concurrency for production APIs.

Follow-up: When would you use Provisioned Concurrency? (When you need consistent low latency and can predict traffic)


Q15: Explain Lambda concurrency and reserved concurrency.

Difficulty: Mid

Answer:

Concurrency: Number of execution environments running simultaneously.

Default Limits:

Reserved Concurrency:

Use Cases:

Throttling:

Real-world Context: Lambda calls external API with 100 req/sec limit. Set reserved concurrency to 100 to prevent exceeding API limit.

Follow-up: What happens when account concurrency limit is reached? (All functions throttled unless reserved concurrency set)


Q16: What is Lambda@Edge and what are its use cases?

Difficulty: Senior

Answer:

Lambda@Edge runs Lambda functions at CloudFront edge locations, closer to users.

Use Cases:

Limitations:

Execution Points:

Real-world Context: Redirect mobile users to mobile site, add security headers, implement geo-blocking.

Follow-up: What’s the difference between Viewer Request and Origin Request? (Viewer Request runs on every request, Origin Request runs before cache check)


RDS & Databases

Q17: Explain RDS Multi-AZ and Read Replicas.

Difficulty: Mid

Answer:

Multi-AZ:

Read Replicas:

Key Differences:

Can Combine: Multi-AZ primary with Read Replicas for both HA and read scaling.

Real-world Context: E-commerce site. Multi-AZ for primary DB (HA), Read Replicas for reporting/analytics (read scaling).

Follow-up: What’s the RTO/RPO for Multi-AZ? (RTO: 60-120s, RPO: 0 - no data loss)


Q18: What is RDS Proxy and why would you use it?

Difficulty: Senior

Answer:

RDS Proxy is a fully managed database connection pooler and proxy.

Benefits:

Use Cases:

How it works:

Real-world Context: Lambda functions connecting to RDS. Without proxy, each invocation creates new connection. With proxy, connections reused.

Follow-up: What’s the difference between RDS Proxy and connection pooling in application? (RDS Proxy is managed, shared across applications, handles failover)


CloudWatch & Monitoring

Q19: Explain CloudWatch Logs, Metrics, and Alarms.

Difficulty: Mid

Answer:

CloudWatch Logs:

CloudWatch Metrics:

CloudWatch Alarms:

Real-world Context: Monitor CPU utilization → CloudWatch Metric. Alert when > 80% → CloudWatch Alarm → SNS → Email. Auto Scale based on alarm → ASG.

Follow-up: What’s the difference between CloudWatch and CloudTrail? (CloudWatch: metrics/logs, CloudTrail: API calls/audit)


Q20: What is CloudWatch Logs Insights and how do you use it?

Difficulty: Mid

Answer:

CloudWatch Logs Insights is a query language to search and analyze log data.

Query Syntax:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)

Common Use Cases:

Fields:

Real-world Context: Application logs contain JSON. Query: fields @timestamp, level, error | filter level = "ERROR" | stats count() by error.

Follow-up: How do you create a dashboard from Logs Insights query? (Save query, add to dashboard, or export to CloudWatch Metrics)


Summary

These questions cover fundamental AWS concepts across compute, storage, networking, security, serverless, databases, and monitoring. Practice explaining these concepts clearly and relate them to real-world scenarios you’ve encountered or designed.

Next Steps: